Background

This projects looks at vaccinations administered for Covid-19 over time, by manufacturer. Covid-19 has affected people worldwide and with media attention heavily focused upon vaccination uptake and potential benefits and risks of the varying vaccinations offered, I was interested to see how many vaccinations had actually been administered over time and by whom. The data is collected from [https://www.kaggle.com/code/fit4kz/covid-immunizations-analysis-in-r/data], and is a Kaggle data set, plotted and visualized in to varying interactive line graph formats. The data collected includes daily entries of total vaccinations, manufacturer and country of vaccination from 4/12/2020 to 8/3/2022. The following steps provide a guide to achieve this with code provided for use in RStudio. Various packages are used and may need to be installed before running the code, as may pathway specification to local drives for accessing the data and saving plots, scripts and code books. The package ‘here’ has been used within the code however to try to minimize these issues and set the working directory to wherever the user is working from.

Aims

The data set lends itself to investigate several outcomes. I was interested in finding total vaccinations administered worldwide to date, total vaccinations administered by country to see which country had the largest vaccination uptake, and finally total vaccinations administered over time by manufacturer, to see which vaccine was administered most worldwide.

Load and view data

First we need to load all packages used in RStudio, which can be installed using the install.packages() function.

#Load packages

library(tidyverse)
library(here)
library(ggplot2)
library(gganimate)
library(gifski)
library(png)
library(dplyr)
library(plotly)
library(highcharter)
library(codebook)
library(future)

Next we need to import the data.

#Load data

vaccinations <- read.csv(here("data", "vaccinations.csv"))

Now we should take a look at the data itself.

head(vaccinations)
##   location       date            vaccine total_vaccinations
## 1  Austria 2021-01-08    Johnson&Johnson                  0
## 2  Austria 2021-01-08            Moderna                  0
## 3  Austria 2021-01-08            Novavax                  0
## 4  Austria 2021-01-08 Oxford/AstraZeneca                  0
## 5  Austria 2021-01-08    Pfizer/BioNTech              31641
## 6  Austria 2021-01-15    Johnson&Johnson                  0

We can see there are four variables location, date, vaccine and total vaccinations. By using the code below we can also see the class of variable according to RStudio.

#View data set
glimpse(vaccinations)
## Rows: 31,127
## Columns: 4
## $ location           <chr> "Austria", "Austria", "Austria", "Austria", "Austri~
## $ date               <chr> "2021-01-08", "2021-01-08", "2021-01-08", "2021-01-~
## $ vaccine            <chr> "Johnson&Johnson", "Moderna", "Novavax", "Oxford/As~
## $ total_vaccinations <int> 0, 0, 0, 0, 31641, 0, 99, 0, 0, 117141, 0, 343, 0, ~

Here we can see that location, date and vaccine are classed as characters and total vaccinations is classed as interval data. We need to format some of the options and data so RStudio knows how to correctly interpret certain values within the data set.

Preparing the data

Specifying that the date variable is a date, ensures the variable is interpreted the right way by RStudio.

#Create a variable with the correct formats

vaccinations.formatted <- vaccinations %>%
  mutate(date = as.Date(date))

#Change limits for scientific as y axis values are so high and will show as scientific values without this code

vaccinations.formatted <- vaccinations.formatted %>%
  mutate(total_vaccinations = total_vaccinations / 1000000)

options(scipen = 1000000)

Total number of vaccinations is now shown in millions. Checking for duplicates in the data set ensures data, variables or trials are not read twice by RStudio and therefore skewing results.

#check for duplicates in the data set

anyDuplicated(vaccinations.formatted)

#There are no duplicates in this data set

Creating dataframes to answer aims

#Create a data frame that shows total vaccinations administered worldwide

vaccinations_worldwide <- vaccinations.formatted %>%
  summarise(total_immunizations = sum(total_vaccinations, na.rm = TRUE))

vaccinations_worldwide
##   total_immunizations
## 1            469784.6
#Create another data frame to find total vaccinations administered by country

vaccinations_by_country <- vaccinations.formatted %>%
  group_by(location) %>%
  summarise(total_immunizations = sum(total_vaccinations, na.rm = TRUE), .groups = 'drop') %>%
  arrange(desc(total_immunizations))

vaccinations_by_country
## # A tibble: 42 x 2
##    location       total_immunizations
##    <chr>                        <dbl>
##  1 European Union             174635.
##  2 United States              127696.
##  3 Germany                     35207.
##  4 France                      29067.
##  5 Italy                       25612.
##  6 South Korea                 19142.
##  7 Chile                        9275.
##  8 Peru                         9158.
##  9 Ecuador                      4599.
## 10 Ukraine                      4358.
## # ... with 32 more rows

Creating a plot

We now need to create the basic plot using the ggplot package.

#First create a data frame that has total vaccinations by manufacturer as the set variables

vaccinations_by_manufacturer <- vaccinations.formatted %>%
  drop_na() %>%
  group_by(date,
           vaccine) %>%
  summarise(total_vaccinations = sum(total_vaccinations), .groups = 'drop')

Now that we have our data frame we can create our plot using the ggplot and gganimate packages.

#Create a basic plot showing total vaccinations over time, grouped by vaccine
#Customize plot with geoms and scales
#Add axis labels and a theme
#Animate plot over time using gganimate

p1 <- ggplot(vaccinations_by_manufacturer, aes(x=date, y=total_vaccinations, group=vaccine, color=vaccine)) +
  geom_line() +
  geom_point() + 
  scale_y_continuous() +
  ggtitle("Vaccinations by manufacturer over time.") +
  ylab("Number of Vaccinations") +
  xlab("Date") +
  theme_classic() +
  transition_reveal(date)

#View plot 

p1

#Save plot as a gif 

anim_save(here("output", "p1gganimate.gif"))

We can also save this plot as a static picture by saving as a .png file.

#Save plot as a .png file

p1 <- ggplot(vaccinations_by_manufacturer, aes(x=date, y=total_vaccinations, group=vaccine, color=vaccine)) +
  geom_line() +
  geom_point() + 
  scale_y_continuous() +
  ggtitle("Vaccinations by manufacturer over time.") +
  ylab("Number of Vaccinations") +
  xlab("Date") +
  theme_classic()

ggsave(here("output", "p1ggplot.png"))

Adding plotly allows us to make the plot interactive, with a table that offers daily information on total vaccines administered by manufacturer when hovering the cursor over a data point on the plot. This allows for a more in depth look at the data shown, as well as still providing a clean and easy to read overview.

#Add plotly to create an interactive plot

p2 <- ggplot(vaccinations_by_manufacturer, aes(x=date, y=total_vaccinations, group=vaccine, color=vaccine)) +
  geom_line() +
  geom_point() + 
  scale_y_continuous() +
  ggtitle("Vaccinations by manufacturer over time.") +
  ylab("Number of Vaccinations") +
  xlab("Date") +
  theme_classic()
  
p2 <- ggplotly(p2)
p2

Although the plot can be downloaded as a png when viewed, saving the graph as a HTML file creates a file with all needed JavaScript and CSS dependency files contained within it, that can be viewed online and is still interactive.

#Save plotly as HTML file  

htmlwidgets::saveWidget(p2, "output/p2plotly.html")

Highcharter

I like the interactive element of the plotly graph, and tidying and exploring more with this plot is quite possible in ggplot. Making use of another package called Highcharter, allows for further visual finesse, with differing themes and effects to choose from, increasing impact to your plot. Highcharter has an excellent tooltip with many customization functions, including a more refined interactive table which we will be using and coding for below. Highcharter is also written in pure JavaScript, making adding interactive charts to web sites or web applications easier.

#Plotting the data into an interactive line graph, using highcharter to add a refined finish

p3 <-  vaccinations_by_manufacturer %>%
  hchart('line', hcaes(x = date, y = total_vaccinations, group = vaccine)) %>%
  hc_title(text = 'Covid 19 vaccinations over time by manufacturer') %>%
  hc_subtitle(text = 'Number of administered vaccinations for Covid 19 over time and by each manufacturer') %>%
  hc_xAxis(title = list(text = 'Date')) %>%
  hc_yAxis(title = list(text = 'Total administered vaccinations in millions')) %>%
  hc_tooltip(crosshairs = TRUE, sort = TRUE, borderWidth = 6, table = TRUE, shared = TRUE) %>%
  hc_add_theme(hc_theme_ffx()) %>%
  hc_exporting(enabled = TRUE, filename = "plots/vaccinations-by-manufacturer-highcharter")
  
#view plot

p3

By defining that exporting is enabled allows the plot to always be exported when viewed, with the plot being saved as a HTML file in the same way as the previous plotly graph.

#Save highcharter plot

htmlwidgets::saveWidget(p3, "output/p3highcharter.html")

Codebook

Create a code book that uses metadata to give a technical overview of the data frame used to create the plots which specify the values RStudio attributes.

#Create a code book

codebook <- codebook(vaccinations_by_manufacturer)

#Save code book

htmltools::save_html(codebook, "output/codebook.html")

Discussion

I found this data easy to work with to answer my aims, and it was from a reliable source. Source data only contained the original data file, which was transferred into this project after unzipping.

I have found this project a steep learning curve, but enjoyable. I have learnt so much about the use of RStudio for visualizing data and presenting research in an open format online. I think one of the most challenging aspects of the project was adapting code to suit my needs. I had very clear ideas around what I wanted to do and functions I needed to perform to get there, but in the tailoring of functions I generally ran in to errors that required me to get a better understanding of the function I wanted to perform, before I could amend the code correctly and it would run error free.