ubuntu2004
The Clock is "Ticking" to Address Lyme Disease in the U.S.

Background and Motivation
$\;\;\;\;\;\;$ Salar is a 3rd year Mathematics major with a Statistics emphasis and a minor in Political Science. Because of his career interest in the field of Public Health, Salar wanted to analyze data about a specific disease. Salar completed an internship in the Office of Health Care Statistics at the Utah Department of Health (UDOH) in the Fall of 2021 and during this internship Salar learned a lot about the ethical considerations as well as federal and state laws that govern data about a patient’s health. He also learned about the software, techniques, and data analysis tools that UDOH uses. Salar sees this project as a great opportunity to execute the skills he learned from his internship at UDOH. Salar also has experience analyzing datasets using the programming language R and he’s excited to carry over those skills to this project.
$\;\;\;\;\;\;$ Salar had an interest in this project for several reasons. Because Salar wants to get a Master’s Degree in Biostatistics and eventually work as a Biostatistician at a government agency or private company, he sees the completion of this project as a great addition to his future curriculum vitae. Salar also believes that this project will give him a good taste of the data analysis that he’ll be doing in his future career. The news and data analysis surrounding COVID-19 is what sparked Salar’s interest in the field of Public Health.
$\;\;\;\;\;\;$ Ian is majoring in Mathematics with a Statistics emphasis and is interested in data, analyzing data, and using the analysis to improve different aspects of the world. Growing up, Ian was most interested in sports and how numbers impacted the game such as turnover %, ppg, dreb etc. Ian loves comparing two teams to see which is better in various aspects of the sport. No matter what sport, Ian is interested in the numbers behind the sport. Ian always knew that there was a huge data industry in health with different types of diseases especially when COVID-19 came around. That was the first time that Ian really started playing with data, testing the effects and trying to predict certain things. That was when Ian realized that there are infinitely many ways to compare variables together as long as there is data. Ian has never really understood Lyme Disease so he thought it would be interesting to look into it and experiment with data around it. Ian’s approach was more of a “why not” perspective and he wanted to try something new instead of sports.
$\;\;\;\;\;\;$ Nicolas was generally pretty open about topics for the project. Public health is an area that is super interesting to him as it corresponds with one of his majors, Political Science, in some capacities. Nicolas has worked with a number of different programs for analyzing data and has a general understanding of how to do it from his previous job with a car dealership. In general this is a good precursor to learn more about one potential career he’s interested in.
$\;\;\;\;\;\;$ When it comes to health data, Nicolas has been pretty involved in looking at it over the course of the COVID-19 pandemic. Through the process of the pandemic, Nicholas became pretty adept at looking at raw health data as there was simply so much coming out at that time. Nicolas thinks public health is very important in the grand scheme of things so looking at things like distribution of Lyme Disease across people and the factors that affect that are very important.
Project Objectives
$\;\;\;\;\;\;$ The primary questions we are trying to answer in our project are (1) is Lyme disease more common in urban, suburban or rural counties, (2) which regions of the U.S. have more Lyme disease cases at the county level as compared to the population of those counties, (3) does the population density of deers increase the risk of Lyme disease, and (4) can demographic data help us predict where Lyme is more common? We wanted to use our datasets to focus on each of these individual questions so that potentially we could warn people with potential risks of demographics and different risk factors including wildlife and environment. We would hope that spreading this information would lessen the risk and case numbers for the disease.
Background
$\;\;\;\;\;\;$ Lyme disease is a common parasitic disease that spreads from animal and insect bites. It happens to be the most common parasitic disease in the U.S. This disease caused by a bacteria is transmitted to humans through blacklegged ticks, otherwise known as deer ticks. Symptoms of this disease include fever, fatigue, but most commonly, skin rashes. Lyme disease can be easily treated with antibiotics, however if left unattended and untreated the disease may spread to your heart and nervous system. This can lead to arthritis, nerve paralysis, and heart irregularity. According to an article titled "Challenges in Predicting Lyme Disease Risk'', "Since the mid-1990s, the number of US counties where the primary tick vector, Ixodes scapularis, is documented has increased, as has the number of counties that have a high incidence of human Lyme disease" (Eisen and Kugeler). Because the rates of Lyme disease are growing in the U.S. due to the growing prevalence of the deer tick, we hope to educate the public about the risk factors that can make someone more susceptible to contracting Lyme disease. The state of the art techniques in dealing with Lyme disease data are geospatial analysis, bar charts, and line graphs.
Data and Data Description
$\;\;\;\;\;\;$ Our data consists of three total datasets. One consisting of Lyme disease case counts from the U.S. Centers for Disease Control and Prevention (CDC), another consisting of deer density establishments from the Data Repository from the University of Minnesota, and the other being a demographic census report from various U.S. government sources including the CDC, the U.S. Department of Labor, and the U.S. Department of Agriculture. Our main dataest, the Lyme disease dataset, consists of Lyme disease case counts for every county and county equivalent in the U.S from 1992 to 2011 in incriments of four years. In other words, case counts for each individual year between 1992 and 2011 is not included. The second dataset consists of deer density throughout the eastern United States (Alabama, Arkansas, Connecticut, Delaware, Florida, Georgia, Illinois, Indiana, Iowa, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, New Hampshire, New Jersey, New York, North Carolina, Ohio, Pennsylvania, Rhode Island, South Carolina, Tennessee, Vermont, Virginia, West Virginia, and Wisconsin). This dataset was a QDMA spatial map that depicted the density of deer per square mile coded by color. Each county was associated with a color explaining the deer density: White = rare, absent, unknown; Green = less than 15 deer per square mile; Yellow = 15-30 deer per square mile; Orange = 30-40 deer per square mile; Red = >45 deer per square mile. This information was taken by each state’s wildlife agencies from 2001-2005. This spatial map and dataset was collected, digitized, and processed by the Quality Deer Management Association, the University of Minnesota Forest Ecosystem Health Lab, and U.S. Department of Agriculture. The data was produced to provide information of the status and trends of the forest and wildlife health in this region. The data itself was in a GIS data-single shapefile that was only accessible by GIS software ArcGIS pro. Lastly, the census dataset is from openintro’s compilation of data from different census reports and reports from other government sources. This data has results from 2008-2017, but we will be using primarily the data from 2008-2010. This measures demographic data such as race, and average household income. This dataset also is divided up by county so it is easy to compare with our Lyme disease data sets. The county dataset also has information on some territories not in our Lyme disease dataset, such as Puerto Rico and we consider this to be just extraneous information.
Data Processing
$\;\;\;\;\;\;$ The first step in processing the Lyme Diease case counts dataset was putting in zeroes in places where there existed blank cells in the data. This was thankfully easy to do by follwing a series of steps to alter my csv file in Excel. After completing this first step, we turned my csv file into a Pandas dataframe. Then, we added an extra column to my dataframe. This extra column was a merged column of four other coulms. These four columns were four time intervals of the years from 1992 to 2011 (1992 to 1996, 1997 to 2001, 2002 to 2006, and 2007 to 2011). We decided to merge the four columns together because we wanted to analyze the Lyme disease case counts over the whole 1992 to 2011 time interval rather than analyze it in four year increments. Analyzing the data in four year increments would have been more work and it would have made it harder for viewers of the projects to get the big picture of what we were trying to convey with our data.
$\;\;\;\;\;\;$ For the demographic census report, we took this one step further by combining it with the Lyme diease dataset, as the columns for Lyme disease data and census data matched up for the most part. We then filled in values that were empty or had NaN (not a number) values and removed extranious collumns from his dataset.
$\;\;\;\;\;\;$ The deer density data was especially difficult to process, clean, and explore. The original data came in a GIS shapefile which was only accessible through a special subscription software called ArcGIS Pro. I was able to get help from specialists to access the data. However the dataset was very empty and plain. The regions that the deer were calculated from was encoded by an FID code for the region along with the corresponding color code to the region.
Exploratory Analysis
$\;\;\;\;\;\;$ Exploration of the Lyme Disease case counts dataset began by identifying which states had high case counts at the county level. We quantified a county as having a high number of cases if it had more than 5000 cases, a moderate amount of cases if it had between 1000 and 5000 cases, and a low amount of cases if it had less than 1000 cases. We then decided to make bar graphs of Lyme disease case counts in five different states: Pennsylvania, Minnesota, California, Utah, and North Carolina. The idea behind choosing these five states was that we wanted to analyze all four of the common regions of the U.S. (the Northeast, the South, the Midwest, and the West), we wanted to analyze a mix of predominantly urban, suburban, and rural states, and we wanted to analyze states with both high and low Lyme disease case counts at the county level. We realized that the intital series of bar graphs that we made for these states were problematic because they were not comparing the case counts of Lyme disease in each county with the population in each county. In other words, while our first series of bar graphs are benefitial in helping viewers see the actual number of Lyme disease cases in each county, they are not helpful when it comes to actually understanding how prevalent Lyme diseae actually is within each county and as compared to other counties. This is why we decided to create a second series of bar graphs that displayed the Lyme disease case counts divided by the population for each county in each of the five states. By creating this second series of bar graphs, we hope that viewers can better understand how Lyme disease counts compare to the population of each county within a state.
$\;\;\;\;\;\;$ Our second code block below after the first code blocks consists of the exploratory analysis for the deer data. Taking the preliminary Lyme disease and census dataset, we combined all lyme disease cases and took the rates of Lyme disease for each state (see Dataset 1). Since our deer density dataset focused on the year 2001-2005 we decided to look at the Lyme disease from 2002-2006 to cover around the same years. We ordered the states from west to east and as you can see from the graph, (see Graphs 1 & 2) we observe that the cases and case rates are significantly higher in the east. So in order to see if deer density has an impact on Lyme disease, we look at the deer density of the eastern U.S. states which is exactly the dataset that we have. If we just create a scatter plot of deer densities in each county vs Lyme disease cases in each county, that wouldn’t really tell us anything and nor would it work since our deer density column data is categorical. So taking our filtered and cleaned dataset of each county, its respective deer density nominal data and Lyme disease cases (see Dataset 2), we performed a DBSCAN cluster analysis. To do this we first created a distance matrix using Gower Dissimilarity (DS) since DS would calculate how similar each point was to the other using both the numerical and categorical data that we have in our dataset. Then using the DBSCAN algorithm we performed a clustering of our distance matrix and fit our data into the model. From our scatter plot (see Graph 3) you can see how it seems that most cases came from the deer density “3” which corresponds to orange but the graph does not give us enough information to explicitly say that deer density factors into Lyme disease. So we added the respective cluster for each county into the dataset and you can see we have more than 200 clusters (see Dataset 3). We then calculated the homogeneity, completeness, and V-measure. They were all significantly low as homogeneity was 0.556, completeness measured 0.328, and V-measure was explained as 0.413.
Analysis Methodology
$\;\;\;\;\;\;$ For the Lyme disease dataset, we made the decision to first quantify what a high versus moderate versus low amount of cases looks like in the U.S at the county level. Our strategy for this was to make bar graphs of Lyme disease cases in Pennslyvania, Minnesota, California, North Carolina, and Utah since these states have the highlest, middlemost, and lowest cases of Lyme disease respectively.
$\;\;\;\;\;\;$ For the deer density dataset, we chose to use a clustering methodology by using the DBSCAN method to analyze my data. We chose this because we felt like by setting the categorical deer density data to nominal data then we would have four clusters of dense clouds of points (being the four different categories in the deer density data) so that points in the same cluster would be in the same high density region, telling us in the end if counties with green deer density would cluster with lower case numbers and similarly if counties with red deer density would cluster with higher case numbers. We wanted to use DBSCAN because of its ability to deal with outliers and potential complex shapes of our data.
$\;\;\;\;\;\;$ For the last dataset we cleaned the data, and combined the Lyme disease and county datasets into one big dataset, as the rows of the dataset all line county by county. The first thing we did was get rid of redundant columns that didn’t really matter to the data we were looking at (out of date range, averages over years outside of our date range, etc. We are currently planning to take the demographic data and make a scatterplot matrix to see if there is correlation between specific data and Lyme cases. One thing we will have to do to see a clearer picture is separate our data between states high in Lyme disease cases and low in Lyme disease cases as Lyme disease case are not prevalent everywhere. We will also use classic regression models for this purpose too. One thing that we may do after finishing the prior goals is see if we can use Machine Learning methods we learned in class to see how a machine does at predicting the cases in a county based on demographic data. This would likely only use data from states high in cases as Lyme is also very particular about where it is prevalent.
Results
$\;\;\;\;\;\;$ Our first set of bar graphs of the Lyme disease case counts dataset showed that the condition is highly concentrated in midwestern and northeastern states and it's less concentrated in western and southern states. Our second set of bar graphs showed us that, for the most part, rural counties reported more Lyme disease cases in relation to their population as compared to more urban and suburban counties within a state. This was surprising to us because we made the hypothesis that Lyme disease was spreading in more urban counties in the northeastern United States before we started analyzing our data. It was clear to us after making the second series of bar graphs that this was not always the case.
$\;\;\;\;\;\;$ For the deer density data we conlcuded that with over 200 clusters and low homogeneity, completeness, and V-measure scores, we cannot conclusively say that deer density impacts Lyme disease cases. There may be other factors such as other ways this disease may be spread such as contact of humans. We also are only looking at confirmed cases, and there is a possibility of insufficient data with counties since the regions overlapped some counties and did not account for others. Collecting this data also presents a large amount of possible errors with the ways the data was collected. However this does not mean deers do not have an impact on Lyme disease.
$\;\;\;\;\;\;$ When looking at demographic data we find some very intresting things looking at regression graphs. Similar to the trends we noticed in Massachusets originally, Lyme cases in general followed a trend that as white population went up Lyme case rates also went up. This was interesting as it was the opposite as the % of almost any race but white had an inverse relationship with Lyme cases. A similar pattern emerged when you combared under 18 year-old percentages to percentages of those over 65 in these counties. Counties with more people who were older often had more Lyme disease, while counties that had more younger people had less. We also found that batchelor degrees had an inverse relationship with Lyme. More specific information on results can be found below with graphs that explain the correlations more indepth, and with their specific correlation coeficients
Evaluation
$\;\;\;\;\;\;$ We measured the success of our project by looking how many different types of analyses of the data we were able to accomplish. As was mentioned earlier, the Lyme disease case counts dataset was analyzed through a display of bar graphs and linear regression, the demographic census report was analyzed through linear regression, and the deer density establishments data was analyzed through DBSCAN. Overall, we feel that our project was quite successful in the sense that we displayed a variety of data analysis techniques and were able to gather results from our analysis to evaluate our project objectives. Even though we didn't get super conclusive answers to our objective questions, we still gathered enough data to create an idea and advance into the study and potentaily create deeper questions.
Ethical Considerations
$\;\;\;\;\;\;$ For the Lyme disease case counts dataset, it is being used in an ethical manner because the author of the dataset, Kiersten Kugeler, is being credited and the licensing information for the dataset also says that it is in the public domain. The use of the deer density dataset comes with a data use agreement, stating that the data is limited to whoever downloaded the data, and that anyone else that wants to use the data must download it on their own and agree to the terms. Any publication or presentation that uses the data should cite this module data source and that a copy of the publication will be sent to the source. Lastly, while the census dataset is a compilation of many governmental reports since these are all government sources the data is sourced somewhat ethically. In terms of usage everything on the website that was sourced from openintro is free to use under the creative commons.
Project Summary
$\;\;\;\;\;\;$ To summarize, we have completed selecting and cleaning our datasets consiting of Lyme disease cases, deer population density, and census data to answer our objective questions of what demographics impact lyme disease as well as if deer density of white tailed deers impact the disease case numbers. In general, through analyzing our data using regression and clustering, we discovered that case numbers are higher in urban regions however when we look at case rates in respect to population then the case rates are higher in rural regions. We also found that demographic data does have a noticeable effect on Lyme disease rates, at least when looking at places with higher case counts. Demographic data is a red herring in some counties though, as it is likely because of geographic location. However we cannot say that deer density also has an impact on Lyme disease as the cluster analysis was not very significant. There were too many clusters without a good homogeneity score so we couldn't really cluster the data to say that specific deer densities relate to and affect the amount of disease cases. But this does not mean that deers do not have an impact at all, as it could be a positive or negative impact. As stated before, other factors could play into the number of cases such as the way the disease travles, if weather, time and environment impacts case rates, or if the disease may spread in other ways as well. There are a few areas where we could use future research, the first is on the mid\-low case Lyme disease areas, and the effect of demographic data on their case counts. Secondly, we can research more on how lyme disease is spread though ticks and how ticks travel and see if deer density impacts the tick density. A topic that we can ask is using machine learning, deer population, and demographic changes, can we project where Lyme will have a higher rate in the future? Finally, we can also research why our demographic data is the way it is, some studies have already begun analyzing the difference in Lyme prevalence by race, but age and other factors remain unstudied.
Accessed 3 Apr. 2023.Kugeler, Kiersten J. "LymeDisease_9211_county." Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 27 Aug. 2015,
data.cdc.gov/dataset/LymeDisease_9211_county/smai-7mz9. Accessed 3 Apr. 2023.Walters, Brian F.; Woodall, Christopher W.; Russell, Matthew B.. (2016). White-tailed deer density estimates across the eastern United States, 2008. Retrieved from the Data Repository for the University of Minnesota, http://dx.doi.org/10.13020/D6G014. “United States Counties.” Data Sets, https://www.openintro.org/data/index.php?data=county_complete.
Accessed 3 Apr. 2023.
Data Visualization Module
From the first set of bar graphs, we can see that actual Lyme disease case counts are more prevalent in urban and suburban counties as compared to rural counties. For example in the state of Pennsylvania, Bucks, Chester, and Montgomery counties all contain a large number of Lyme disease cases, and all these counties contain suburbs or the major city of Philadelphia.
From the second set of bar graphs, we can see that Lyme disease is more prevalent in rural counties when comparing case counts to population. For example, in the state of California, Lyme disease cases are overrepresented in the rural northern counties of the state such as Humboldt, Mendocino, and Trinity counties.
Linear Regression Analysis of U.S. Census Report and Lyme disease case counts dataset
Above we defined the data set for the demographic data, one thing to realize is that many places have 0 cases based solely on geographic data. Since I wanted to see the effect of demographic data on cases I decided to look at the type 5% of case counts by county as this lets us see specifically how demographic data affects our lime disease counts. The name of this new data set is top10.
The first thing we wanted to look at was the data on racial percentages in these counties compared to lime cases. We divided our cases by population to get the rate of lime disease that isn't skewed by population. After we did this we created some regression plots for our variables of diffrent racial percentages in our data set. The first thing we found was that in general across multiple tests as the % of white people went up the % of lime also went up. On the contrary for almost every other race lime went down as their % went up. These were mostly weak to moderate correlations, which isn't definite information but is still notable.
Our next variable we wanted to test was age. Our dataset had data on % of the population under 5, 18 and over 65. We once again compared this to our limerate variable as we do for the rest of this part of the report. We found moderate/weak negative correlations with younger ages and limerate. On the other hand we found the opposite with those over the age of 65 with their correlation coeficient of +0.31.
Next, we looked at more economic/educational data. Here we didn't find as large correlations, but we found that homeownership was moderately positively correlated with limerate. A surprising point was that places with higher work travel and bachelor degrees were actually the opposite with a weaker negative correlation with limerate.
We also looked at density, median household income, and sales per capita. None of these had strong correlations one way or another.
Finally, we made a scatterplot matrix of the racial percentages that had moderate correlations with limerate, and compared them to the raw case count in our top10 dataset. The area of interest is the top right and bottom left quadrant.
DBSCAN analysis on deer density dataset
Dataset 1
This dataset consists of the census population collected at the years 1990, 2000, and 2010 for each state as well as the number of lyme disease cases for each state from the years 1992-1996, 1997-2001, 2002-2006, and 2007-2011. Using the case numbers and population for each state during the respective stretch of years we calculated the case rates per state as well. This data is sorted by state from west to east.
Graph 1
Graph 2
Graphs 1 and 2 represent the number of cases as well as the case rates for each state ranging from west to east. As you can observe cases and case rates seem to be increasing more as you travel towards the eastern states.
Dataset 2
This dataset represents the filtered deer density and case numbers for each county from the years 2002-2006 with a nominal variable representing the colors for the deer density, with white = 0, green = 1, yellow = 2, orange = 3, and red = 4.
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: gower in ./.local/lib/python3.8/site-packages (0.1.2)
Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from gower) (1.22.4)
Requirement already satisfied: scipy in /usr/local/lib/python3.8/dist-packages (from gower) (1.10.1)
[notice] A new release of pip is available: 23.0.1 -> 23.1
[notice] To update, run: python3 -m pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
A matrix of the distance between each point is created to use in our DBSCAN model fitting.
Graph 3
Dataset 3 (with clusters)