Chapter 2 Data sources
2.1 Covid-19 Datasets
Xiaoyu is responsible for collecting the Covid-19-US data and the US 2020 election data. The dataset is downloaded from the JHU CSSE COVID-19 Dataset github repository, which contains up-to-date Covid-19 data for the United States as well as the world. This data repository collected data from numerous other sources such as WHO, ECDC, US CDC, BNO News, and is updated every day.
For the purpose of this project, we will be using the covid_19_confirmed_US dataset and the covid_19_death_US dataset.
## UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat
## 1 84001001 US USA 840 1001 Autauga Alabama US 32.53953
## 2 84001003 US USA 840 1003 Baldwin Alabama US 30.72775
## 3 84001005 US USA 840 1005 Barbour Alabama US 31.86826
## 4 84001007 US USA 840 1007 Bibb Alabama US 32.99642
## 5 84001009 US USA 840 1009 Blount Alabama US 33.98211
## 6 84001011 US USA 840 1011 Bullock Alabama US 32.10031
## Long_ Combined_Key X1.22.20
## 1 -86.64408 Autauga, Alabama, US 0
## 2 -87.72207 Baldwin, Alabama, US 0
## 3 -85.38713 Barbour, Alabama, US 0
## 4 -87.12511 Bibb, Alabama, US 0
## 5 -86.56791 Blount, Alabama, US 0
## 6 -85.71266 Bullock, Alabama, US 0
## [1] 3340 343
To summarize, the dataset contains 3340 rows and 316 columns. We can see that the dataset contains some repetitive information and some ids not useful for our tasks. We can consider dropping those columns. The good thing about this dataset is that it does not contain any missing values.
The first 10 columns contain information that identify a specific county and all the other columns contain the total confirmed cases up until a specific date. We summarize the description of the useful columns and their data types below:
FIPS: Federal Information Processing Standards code that uniquely identifies counties within the USA.
Admin2: County Name; string
Province_State: State Name; string
Lat and Long_ : Latitude and longitude; float
Combined_Key: (County name, State name, US); string
X1.22.20: The total confirmed cases up until 1/22/2020; int
All the other columns has a column name similar to this last column, each represents the total number of cases up until that specific day. The dates are sorted in order.
The covid_19_death_US dataset has the same format except that the numbers representing each day are total deaths rather than confirmed cases.