Background
This projects looks at vaccinations administered for Covid-19 over time, by manufacturer. Covid-19 has affected people worldwide and with media attention heavily focused upon vaccination uptake and potential benefits and risks of the varying vaccinations offered, I was interested to see how many vaccinations had actually been administered over time and by whom. The data is collected from [https://www.kaggle.com/code/fit4kz/covid-immunizations-analysis-in-r/data], and is a Kaggle data set, plotted and visualized in to varying interactive line graph formats. The data collected includes daily entries of total vaccinations, manufacturer and country of vaccination from 4/12/2020 to 8/3/2022. The following steps provide a guide to achieve this with code provided for use in RStudio. Various packages are used and may need to be installed before running the code, as may pathway specification to local drives for accessing the data and saving plots, scripts and code books. The package ‘here’ has been used within the code however to try to minimize these issues and set the working directory to wherever the user is working from.
Aims
The data set lends itself to investigate several outcomes. I was interested in finding total vaccinations administered worldwide to date, total vaccinations administered by country to see which country had the largest vaccination uptake, and finally total vaccinations administered over time by manufacturer, to see which vaccine was administered most worldwide.
Load and view data
First we need to load all packages used in RStudio, which can be installed using the install.packages() function.
#Load packages
library(tidyverse)
library(here)
library(ggplot2)
library(gganimate)
library(gifski)
library(png)
library(dplyr)
library(plotly)
library(highcharter)
library(codebook)
library(future)
Next we need to import the data.
#Load data
<- read.csv(here("data", "vaccinations.csv")) vaccinations
Now we should take a look at the data itself.
head(vaccinations)
## location date vaccine total_vaccinations
## 1 Austria 2021-01-08 Johnson&Johnson 0
## 2 Austria 2021-01-08 Moderna 0
## 3 Austria 2021-01-08 Novavax 0
## 4 Austria 2021-01-08 Oxford/AstraZeneca 0
## 5 Austria 2021-01-08 Pfizer/BioNTech 31641
## 6 Austria 2021-01-15 Johnson&Johnson 0
We can see there are four variables location, date, vaccine and total vaccinations. By using the code below we can also see the class of variable according to RStudio.
#View data set
glimpse(vaccinations)
## Rows: 31,127
## Columns: 4
## $ location <chr> "Austria", "Austria", "Austria", "Austria", "Austri~
## $ date <chr> "2021-01-08", "2021-01-08", "2021-01-08", "2021-01-~
## $ vaccine <chr> "Johnson&Johnson", "Moderna", "Novavax", "Oxford/As~
## $ total_vaccinations <int> 0, 0, 0, 0, 31641, 0, 99, 0, 0, 117141, 0, 343, 0, ~
Here we can see that location, date and vaccine are classed as characters and total vaccinations is classed as interval data. We need to format some of the options and data so RStudio knows how to correctly interpret certain values within the data set.
Preparing the data
Specifying that the date variable is a date, ensures the variable is interpreted the right way by RStudio.
#Create a variable with the correct formats
<- vaccinations %>%
vaccinations.formatted mutate(date = as.Date(date))
#Change limits for scientific as y axis values are so high and will show as scientific values without this code
<- vaccinations.formatted %>%
vaccinations.formatted mutate(total_vaccinations = total_vaccinations / 1000000)
options(scipen = 1000000)
Total number of vaccinations is now shown in millions. Checking for duplicates in the data set ensures data, variables or trials are not read twice by RStudio and therefore skewing results.
#check for duplicates in the data set
anyDuplicated(vaccinations.formatted)
#There are no duplicates in this data set
Creating dataframes to answer aims
#Create a data frame that shows total vaccinations administered worldwide
<- vaccinations.formatted %>%
vaccinations_worldwide summarise(total_immunizations = sum(total_vaccinations, na.rm = TRUE))
vaccinations_worldwide
## total_immunizations
## 1 469784.6
#Create another data frame to find total vaccinations administered by country
<- vaccinations.formatted %>%
vaccinations_by_country group_by(location) %>%
summarise(total_immunizations = sum(total_vaccinations, na.rm = TRUE), .groups = 'drop') %>%
arrange(desc(total_immunizations))
vaccinations_by_country
## # A tibble: 42 x 2
## location total_immunizations
## <chr> <dbl>
## 1 European Union 174635.
## 2 United States 127696.
## 3 Germany 35207.
## 4 France 29067.
## 5 Italy 25612.
## 6 South Korea 19142.
## 7 Chile 9275.
## 8 Peru 9158.
## 9 Ecuador 4599.
## 10 Ukraine 4358.
## # ... with 32 more rows
Creating a plot
We now need to create the basic plot using the ggplot package.
#First create a data frame that has total vaccinations by manufacturer as the set variables
<- vaccinations.formatted %>%
vaccinations_by_manufacturer drop_na() %>%
group_by(date,
%>%
vaccine) summarise(total_vaccinations = sum(total_vaccinations), .groups = 'drop')
Now that we have our data frame we can create our plot using the ggplot and gganimate packages.
#Create a basic plot showing total vaccinations over time, grouped by vaccine
#Customize plot with geoms and scales
#Add axis labels and a theme
#Animate plot over time using gganimate
<- ggplot(vaccinations_by_manufacturer, aes(x=date, y=total_vaccinations, group=vaccine, color=vaccine)) +
p1 geom_line() +
geom_point() +
scale_y_continuous() +
ggtitle("Vaccinations by manufacturer over time.") +
ylab("Number of Vaccinations") +
xlab("Date") +
theme_classic() +
transition_reveal(date)
#View plot
p1
#Save plot as a gif
anim_save(here("output", "p1gganimate.gif"))
We can also save this plot as a static picture by saving as a .png file.
#Save plot as a .png file
<- ggplot(vaccinations_by_manufacturer, aes(x=date, y=total_vaccinations, group=vaccine, color=vaccine)) +
p1 geom_line() +
geom_point() +
scale_y_continuous() +
ggtitle("Vaccinations by manufacturer over time.") +
ylab("Number of Vaccinations") +
xlab("Date") +
theme_classic()
ggsave(here("output", "p1ggplot.png"))
Adding plotly allows us to make the plot interactive, with a table that offers daily information on total vaccines administered by manufacturer when hovering the cursor over a data point on the plot. This allows for a more in depth look at the data shown, as well as still providing a clean and easy to read overview.
#Add plotly to create an interactive plot
<- ggplot(vaccinations_by_manufacturer, aes(x=date, y=total_vaccinations, group=vaccine, color=vaccine)) +
p2 geom_line() +
geom_point() +
scale_y_continuous() +
ggtitle("Vaccinations by manufacturer over time.") +
ylab("Number of Vaccinations") +
xlab("Date") +
theme_classic()
<- ggplotly(p2)
p2 p2
Although the plot can be downloaded as a png when viewed, saving the graph as a HTML file creates a file with all needed JavaScript and CSS dependency files contained within it, that can be viewed online and is still interactive.
#Save plotly as HTML file
::saveWidget(p2, "output/p2plotly.html") htmlwidgets
Highcharter
I like the interactive element of the plotly graph, and tidying and exploring more with this plot is quite possible in ggplot. Making use of another package called Highcharter, allows for further visual finesse, with differing themes and effects to choose from, increasing impact to your plot. Highcharter has an excellent tooltip with many customization functions, including a more refined interactive table which we will be using and coding for below. Highcharter is also written in pure JavaScript, making adding interactive charts to web sites or web applications easier.
#Plotting the data into an interactive line graph, using highcharter to add a refined finish
<- vaccinations_by_manufacturer %>%
p3 hchart('line', hcaes(x = date, y = total_vaccinations, group = vaccine)) %>%
hc_title(text = 'Covid 19 vaccinations over time by manufacturer') %>%
hc_subtitle(text = 'Number of administered vaccinations for Covid 19 over time and by each manufacturer') %>%
hc_xAxis(title = list(text = 'Date')) %>%
hc_yAxis(title = list(text = 'Total administered vaccinations in millions')) %>%
hc_tooltip(crosshairs = TRUE, sort = TRUE, borderWidth = 6, table = TRUE, shared = TRUE) %>%
hc_add_theme(hc_theme_ffx()) %>%
hc_exporting(enabled = TRUE, filename = "plots/vaccinations-by-manufacturer-highcharter")
#view plot
p3
By defining that exporting is enabled allows the plot to always be exported when viewed, with the plot being saved as a HTML file in the same way as the previous plotly graph.
#Save highcharter plot
::saveWidget(p3, "output/p3highcharter.html") htmlwidgets
Codebook
Create a code book that uses metadata to give a technical overview of the data frame used to create the plots which specify the values RStudio attributes.
#Create a code book
<- codebook(vaccinations_by_manufacturer)
codebook
#Save code book
::save_html(codebook, "output/codebook.html") htmltools
Discussion
I found this data easy to work with to answer my aims, and it was from a reliable source. Source data only contained the original data file, which was transferred into this project after unzipping.
I have found this project a steep learning curve, but enjoyable. I have learnt so much about the use of RStudio for visualizing data and presenting research in an open format online. I think one of the most challenging aspects of the project was adapting code to suit my needs. I had very clear ideas around what I wanted to do and functions I needed to perform to get there, but in the tailoring of functions I generally ran in to errors that required me to get a better understanding of the function I wanted to perform, before I could amend the code correctly and it would run error free.