Chapter 2 Data sources

2.1 Covid-19 Datasets

Xiaoyu is responsible for collecting the Covid-19-US data and the US 2020 election data. The dataset is downloaded from the JHU CSSE COVID-19 Dataset github repository, which contains up-to-date Covid-19 data for the United States as well as the world. This data repository collected data from numerous other sources such as WHO, ECDC, US CDC, BNO News, and is updated every day.

For the purpose of this project, we will be using the covid_19_confirmed_US dataset and the covid_19_death_US dataset.

##        UID iso2 iso3 code3 FIPS  Admin2 Province_State Country_Region      Lat
## 1 84001001   US  USA   840 1001 Autauga        Alabama             US 32.53953
## 2 84001003   US  USA   840 1003 Baldwin        Alabama             US 30.72775
## 3 84001005   US  USA   840 1005 Barbour        Alabama             US 31.86826
## 4 84001007   US  USA   840 1007    Bibb        Alabama             US 32.99642
## 5 84001009   US  USA   840 1009  Blount        Alabama             US 33.98211
## 6 84001011   US  USA   840 1011 Bullock        Alabama             US 32.10031
##       Long_         Combined_Key X1.22.20
## 1 -86.64408 Autauga, Alabama, US        0
## 2 -87.72207 Baldwin, Alabama, US        0
## 3 -85.38713 Barbour, Alabama, US        0
## 4 -87.12511    Bibb, Alabama, US        0
## 5 -86.56791  Blount, Alabama, US        0
## 6 -85.71266 Bullock, Alabama, US        0
## [1] 3340  343

To summarize, the dataset contains 3340 rows and 316 columns. We can see that the dataset contains some repetitive information and some ids not useful for our tasks. We can consider dropping those columns. The good thing about this dataset is that it does not contain any missing values.

The first 10 columns contain information that identify a specific county and all the other columns contain the total confirmed cases up until a specific date. We summarize the description of the useful columns and their data types below:

FIPS: Federal Information Processing Standards code that uniquely identifies counties within the USA.

Admin2: County Name; string

Province_State: State Name; string

Lat and Long_ : Latitude and longitude; float

Combined_Key: (County name, State name, US); string

X1.22.20: The total confirmed cases up until 1/22/2020; int

All the other columns has a column name similar to this last column, each represents the total number of cases up until that specific day. The dates are sorted in order.

The covid_19_death_US dataset has the same format except that the numbers representing each day are total deaths rather than confirmed cases.