Information About the Author
My name is Giorgi. I am an international student coming from the country of Georgia, double majoring in Mathematical Economics and Data Science at Gettysburg College. I am passionate about capturing, maintaining, processing, analyzing, and communicating data. I find particular interest in finding the answer to the following question – How can we use existing information to make conclusions about the future?
I am also enthusiastic about reading and analyzing news regarding economy, business, and financial markets on The Wall Street Journal, finding the process of applying concepts learned from various courses of economics to real-world scenarios extremely fascinating.
Currently I am working in Dr. Johnson’s lab on a COVID-19 Data Science research project. Besides working, as a member of professor Johnson’s lab group, I engage in everyday 40-minute Hacky Sack sessions that are often characterized by a “sluggish” start and a “competitively” energetic finish. The activity helps me relax, have fun, and socialize with my peers.
Synopsis of the Research Project
Since the beginning of pandemic, copious amounts of data have been collected on the spread of the COVID-19 virus throughout the world’s population. In this data science project, different Python libraries (NumPy, SciPy, Pandas, Matplotlib…) are used to numerically analyze the publicly available demographic data. Specifically, the project uses data from sources, such as usafacts.org – the central repository for all COVID data in the US and the healthdata.gov – one of the most accessible online sources for health data in the United States.
The primary goal of this analysis is to create an interactive data visualization tool that will display temporal correlations between the rates, or peaks, of cases, hospitalizations, and deaths around different parts of the United States.
The research is mostly centered on coding with Python programming language in the Jupyter Notebook – a web-based interactive computing platform. Together with other packages, primary Python libraries for effective data science, such as NumPy, Pandas, and Matplotlib, are used to carry out the processes of data mining, cleansing, filtration, manipulation, transformation, processing, and analysis. To create advanced data visualizations, including the generation of interactive choropleth maps, different functions from libraries like Folium, GeoPandas, Plotly, and PyDeck are also used.
The research takes inspiration from the DataIsBeautiful subreddit that is an open-source forum where individuals research, share, and discuss various types of data visualizations and analyses, providing detailed methodology, data, and source code through the description of their posts.
Daily Work and Insight
Every day I use Jupyter Notebook to work with data. My tasks range from data extraction, collection, and processing to coding and generating interactive visualizations for purposes of displaying data and appropriate conclusions.
During the first part of the research, I mainly focused on cleansing and manipulation of the given structures of data for cases, hospitalizations, and deaths. Replacing NA values with zeroes and eliminating outliers caused by misreporting are two good examples of my tasks in the beginning of the research. Since data for cases and deaths were obtained from usafacts.org, the datasets were structured in a similar fashion.
The following images display the organization of data for daily cumulative cases and deaths.
The data of hospitalizations (from healthdata.gov) was given in a weekly cumulative structure. This data file was granular and required a closer inspection as compared to the previous files of cases and deaths. In fact, in addition to including columns for hospitalizations for different age groups, the file also contained other information, such as the geocoded hospital name, hospital address, number of beds, etc.
After the cleansing process, I performed operation of aggregation in order to convert the raw county-level data into useful state-level information. As the data for hospitalizations was characterized by granularity, it required more combination and grouping.
Data for cases, hospitalizations, and deaths were all given in a cumulative order. As an example, data for cases on specific date was the sum of cases of previous date and the new cases. To fix this, I converted the cumulative data to daily new data, meaning that each grid in a data frame would display new information related to that specific date.
In order to standardize the data from two different sources, I had to convert the daily scale of cases and deaths to the weekly scale, allowing me to start comparing the dynamics of datasets.
Following this, I calculated the cases, hospitalizations, and deaths in each state per capita by dividing the data by the population of each state.
As soon as this step was finished, Professor Johnson and I decided to make the moving average calculation variable, allowing the future users to indicate the computation of the moving average per X number of days for purposes of observing the final data from different perspectives. Therefore, I wrote a code in Python that allows the user to enter the state FIPS (unique number for each state) and the moving average computation variable by which the data is processed.
After this step, I made slight adjustments to the code so that it combined all three parameters (cases, hospitalizations and deaths) and generated a final array, containing a fully prepared data for each. By aggregating the code for all states, I observed the dates for different peaks on a national scale. Following this, I created the “range of inspection” in which dates for national peaks were serving as the midpoint of the interval. Then, I wrote a code that selects the maximum values (peaks) of cases, hospitalizations, and deaths in the set interval for each state. Following this, I began calculating the approximate time difference (Δt) between the three parameters for the first, second, and third peaks.
Finally, I created a code structure, that allows the user to enter the reference number (state FIPS) for two states and the respective moving average computation variables. After running the code, the figure is generated, comparing the plots of cases, hospitalizations, and deaths for the two states.
The image below displays the figures that compare the cases, hospitalizations, and deaths in addition to showing data frames for states of Mississippi and Texas, including approximate time differences between the parameters.
Important Note – normally, peaks happen in the following order – cases, hospitalizations, and deaths. In such cases, the time difference will have a negative sign. Otherwise, if the order is violated in any way, the sign of the ΔT will be positive, as observed when comparing the second peaks of cases and deaths in the second data frame.
What is Next in the Research?
The pure quantitative part of the research is finished. As a checkpoint, I have a working Python code that generates plots and arrays of cases, hospitalization, and deaths for two states that the user wants to use to compare the temporal difference between the different peaks. The remaining time of the research will be spent on creating a user-friendly interactive visualization that will take a form of a US map. To this end, I will be using GeoJSON files and various functions from libraries of Folium, GeoPandas, Plotly, and Pydeck. Besides, I will be constantly going over the code for data processing to make sure that everything is running well and producing the intended results.
As the conclusion of the research, I am willing to publicize the visualization so that individuals from different fields interested in such analysis shall read, share, and discuss the existing work. I am confident that besides serving as an interactive data visualization tool, the research product will act as an instrument of observation and inquiry, raising new questions and inspiring further research in the field. I expect that the research product will definitely generate new data analytics related projects concerned with investigating how different demographic factors, such as political affiliation, socioeconomic status, vaccination rate, race, gender, affect the temporal difference between the peaks of cases, hospitalizations, and deaths throughout the different regions of the United States.
- Data for Cases, Deaths, and Population – https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/
- Data for Hospitalizations – https://healthdata.gov/dataset/COVID-19-Community-Profile-Report-County-Level/di4u-7yu6/data
- DataIsBeautiful subreddit – https://www.reddit.com/r/dataisbeautiful/