GIS

Communicating data effectively with data visualizations: Part 26 (COVID-19 choropleth)

INTRODUCTION

Power BI is a useful tool that allows you to build dashboards and data visuals. One ability that makes Power BI unique is its mapping feature. Although there are pre-built map functions in Power BI, better and easier applications are available through their AppSource store. This tutorial will go over how to create a choropleth of the most recent COVID-19 pandemic using the incidence rate (dated June 21, 2020) for the United States (US) using Power BI and the Drilldown Choropleth application.

 

DATA

You will need to download two types of data: (1) COVID-19 incidence rate for each county in the United States (You can download the Excel file from this Dropbox link) and (2) Shape files for the United States. Fortunately, there are many sources of shape files for the United States, and I have made one available using the Albers USA projection, which has the state of Alaska and Hawaii in the lower left corner of the map. You can use the link to the map shape file here. Data for the shape files comes from the US Census. [I downloaded the ZIP file and converted the SHP files to TopoJSON using the following conversion tool (Mapshaper.org).] The COVID-19 data contains US county codes called the Federal Information Processing Standard Publication (FIPS), a 5-digit code that identifies counties. Incidence rate is defined as the number of confirmed cases per 100,000 persons.

Albers US projection

POWER BI TUTORIAL

For this tutorial, you will need to have Power BI installed on your desktop to re-create this choropleth.

 

Step 1. Get the data

When you start Power BI, you will see a start screen that allows you to select the data you need. Since the COVID-19 incidence rates for the US data are saved as an Excel file, you will need to select Excel as the data type and then connect to the data.

After clicking “Connect,” the Navigator will open. Select Sheet 1 containing the data. Notice that there are two variables (FIPS1 and FIPS). These are the data that will be used to pair the incidence rate to the US county code. The FIPS code has a “0” in front of some of the county code (e.g., 05001) for a total of 5 digits. However, notice that these are in numeric format so the preceding “0” is missing. To remedy this, we need transform the data. Make sure to select “Transform Data” from the Navigator screen.

After selection “Transform,” you will need to change the FIPS variable from numeric to text.

You will be asked to either “Replace” or “Add New Step;” make sure that you select “Replace.” This will replace the numeric data to text, which is what we want for the FIPS variable. Check the data after you do this; notice that the “0” is now preceding values that are less than 5 digits. (FIPS code is a 5-digit county code.) After the data has been transformed, select “Close & Apply” to finalize the data. This will load the data into Power BI, which you will use to build the choropleth.

 

Step 2. Download and Install “Drilldown Choropleth” app

Power BI allows you to download apps from the AppSource store. Click on the “…” and select “Get more visuals.” This will open the AppSource store where you can type the name of the app you want.

Step 3. Enter data into the Drilldown Choropleth app

After you install the Drilldown Choropleth, an icon will appear in your Visualization panel. Click on this icon (Drilldown Choropleth); this will open up options in the Visualization panel. There are two important fields (“Location” and “Values”). Click and drag the FIPS variable into the “Location” field and the Incidence_Rate variable into the “Values” field.

Next, click on the Paint roll icon to open the Drilldown Choropleth options. Expand the “Shape” option to change the “Projection” to “albersUSA” and then enter the link to the TopoJSON shape file for the United States counties. The TopoJSON link is here: https://raw.githubusercontent.com/mbounthavong/Maps/master/c_03mr20.json

 

 

After entering the projection type and the TopoJSON link, you can make changes to the color of the choropleth. Since this is a choropleth, you will need to enter values for the Minimum, Center, and Maximum colors. This will generate a gradient where the darker colors reflect high incidence rates and the lighter colors reflect low incidence rates of COVID-19.

Finally, you can change the size of the visual and save it. Since there are a lot of data being processed, updating the choropleth will take a few minutes. With a little patience, you will generate a suitable choropleth of the COVID-19 incidence across US counties for June 21, 2020.

CONCLUSIONS

Power BI allows us to create visually spectacular choropleths. Additionally, it allows us to create choropleths that are interactive. You can hover over each shape and see the incidence rate per county. However, there are limitations. Inserting a legend to describe what the colors represent is not easily possible (I was unable to solve this problem). There are roundabout ways to get the legend inserted into this choropleth, but it will require additional software (as far as I know). There is potential for the Drilldown Choropleth app to improve by adding an easy way to include functionalities for legends and labels. Despite this limitation, Power BI allows us to create these choropleths without having to resort to more complex programming in R or Python.

Files used in this tutorial are available here.

REFERENCES

Mapshaper.org (link: https://mapshaper.org/)

US Census Cartographic files (link: https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html)

COVID-19 data come from the Johns Hopkins GitHub site (link: https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports/06-21-2020.csv)

 

Communicating data effectively with data visualizations: Part 21 [Examples of famous (and infamous) data visualizations]

FAMOUS (AND INFAMOUS) DATA VISUALIZATIONS

Modern data visualization has a relatively young history compared to other forms of science (e.g., physics, mathematics, chemistry, biology, etc). However, it’s existence can arguably be more historic. Throughout history, we have examples of data visualizations that helped us understand communicable diseases, wartime operations, and the diffusion of technology. Each of these are important in their own respective fields, but making them comprehensive and intuitive would be nearly impossible without creative data visualizations. This article will review several key historical data visualizations from the cholera outbreak to the dawn of the internet and their impact on our society.

JOHN SNOW AND THE CHOLERA OUTBREAK

In the 19th century, little was known about the transmission of disease. The discovery of the germ theory of disease was still in the horizon, and medical knowledge and understanding of its significance had yet to make its way into public health policy. This was true for London during the cholera outbreak of 1854.[1]

John Snow (1813-1858) was an English obstetrician who is considered one of the founders of epidemiology, the study of health and diseases in populations. At the time, diseases were thought to have spread through the air, popularly known as the miasma theory. Snow was one of the first to reject this theory and believed instead that cholera was due to contaminated water that when drunk caused a viscous cycle of diarrhea and dysentery that ultimately led to death. This belief was further supported when Snow discovered that sewage was dumped directly into the Thames River where the city got their drinking water supply. But to prove his theory, Snow had to chart out the outbreak of cholera in Soho, one the London’s suburbs.

Snow meticulously went to the homes of cholera infected patients and learned where they received their drinking water supply. He mapped his findings onto a grid of the city and observed that clusters of outbreaks occurred around specific points in the suburbs, mainly the water pumps (Figure 1).

Figure 1. John Snow’s map of cholera outbreak in Soho, London, 1854.

Source: John Snow - Published by C.F. Cheffins, Lith, Southhampton Buildings, London, England, 1854 in Snow, John. On the Mode of Communication of Cholera, 2nd Ed, John Churchill, New Burlington Street, London, England, 1855. (This image was originally from en.wikipedia; description page is/was here. Image copied from http://matrix.msu.edu/~johnsnow/images/online_companion/chapter_images/fig12-5.jpg)

Snow also noticed that a large cluster of cholera cases occurred in households near the Broad Street pump (Figure 2). In Figure 2, each bar stack represents the number of cholera cases. In particular, the large number of cholera cases near the Broad Street pump provided further evidence that the drinking water supply was contaminated and was the source of the outbreak.

Figure 2. John Snow’s map of cholera outbreaks near the Broad Street pump.

To prove his point, John Snow had the Broad Street pump handle removed and water delivered from another source, further away from the contaminated Thames River. As he predicted, the incidence of cholera dropped rapidly and the outbreak was mitigated.

This was an early example of using data visualization for real-time surveillance of an outbreak that led to a public health intervention. Clusters of cases within the proximity of the hypothesized contamination source effectively illustrated the benefits of geospatial data visualization of the cholera outcomes in the Soho suburbs of London. Today, we rely on spatial data analysis to monitor the influenza epidemic as well as several other diseases, which will help us to quickly react and contain potential outbreaks.

Napoleon’s Russian campaign of 1812

During the Summer of 1812, Napoleon Bonaparte raised over 422,000 troops and personnel to invade Russia. This was in response to the Russian tsar’s, Alexander I, decision to leave the French-led trade union, which undermined Napoleon’s ideologies for an economically strong centralized Europe.

Charles Joseph Minard (1781-1870) illustrated Napoleon’s doomed campaign of 1812 in a graph that famously shows the decline of the once Grande Armée as it began in the Summer to its fall in the early Winter (Figure 3). The graph tells two stories. The first is the start of the campaign which began in the Summer of 1812 and is displayed by the brown line going from Left to Right. The width of the line represents the size of Napoleon’s army at the beginning of the campaign, which numbered approximately 422,000 strong (troops and personnel). Also displayed is the route the army took to reach Moscow. During the journey, the width of the brown line thins representing the attrition of troops due to desertions and causalities. When Napoleon reached Moscow (represented in the right part of the graph) he only had a small fraction of his original strength (approximately 100,000 troops).

On the return trip, represented by the black line, the width of the line thins considerably and is correlated with the rapid drop in temperature, which is represented by the bottom chart. Desertions, casualties, and the weather reduced Napoleon’s army to approximately 10,000 troops and personnel (less than 3 percent of his original strength) by the time he reached the Neman River.

Figure 3. Charles J. Minard’s graph depicted Napoleon’s Grande Armée ill-fated Russian Campaign of 1812.

Source: Charles Joseph Minard's famous graph showing the decreasing size of the Grande Armée as it marches to Moscow (brown line, from left to right) and back (black line, from right to left) with the size of the army equal to the width of the line. Temperature is plotted on the lower graph for the return journey (multiply Réaumur temperatures by 1¼ to get Celsius, e.g. −30 °R = −37.5 °C). Published November 20, 1869. (This image was originally from en.wikipedia; description page is/was here. Image copied from https://en.wikipedia.org/wiki/French_invasion_of_Russia#/media/File:Minard.png)

Minard’s graph shows many data elements highlighting the potential for multiple dimensions incorporated onto a two-dimensional canvas. The lines (both brown and black) denote the route of the army and its strength. At the very bottom of the graph, the temperature of the return journey dropped to below freezing temperatures highlighting the misery of the French troops during the long retreat to France (Figure 4). The creative use of space allowed Minard to include many data dimensions to tell the horribly tragic story of Napoleon’s disastrous Russian campaign. To date, Minard’s graphic is a reminder of the devastating defeat of Napoleon’s ambitions in Europe and the effective use of data visualizations to tell a compelling story.[2]

Figure 4. Temperatures on the return journey (Right to Left).

CARNA BOTNET MAP

In what is now called the Internet Census of 2012, an anonymous hacker produced one of the most important and invaluable data visualization of the diffusion of internet traffic across the globe.[3] Using a botnet and taking advantage of vulnerabilities in network systems, this anonymous hacker was able to penetrate the securities of these networks and then ping these IP addresses to yield a census of active internet networks across the world. The botnet was called Carna, named after the Roman goddess of the door hinge (but she is also known as the goddess of the body). The Carna botnet captured over 1.3 billion IP addresses in the world.

The Carna botnet map is an animated Graphic Interchange Format (GIF) file that provides a 24-hour cycle of internet use around the globe (Figure 5). It was first published sometime in June to October 2012 by the anonymous hacker who wanted to illustrate internet use around the world with all the data that was available. To this day, no one knows the identity of the hacker.

Figure 5. 24-hour world map of IP addresses observed using IP ping requests.

Source: World map of 24 hour relative average utilization of IPv4 addresses observed using ICMP ping requests. Carna Botnet, * Internet Census 2012: Port scanning /0 using insecure embedded devices, Carna Botnet, June - October 2012. 16 March 2013.

The author of this animated GIF uses colors and contrast ratio effectively to deliver a powerful narrative of the daily cycle of internet use. The warm colors represent internet usage during the day and the cool colors represent internet usage after sunset. The nightly cycle moves from Right to Left giving the impression that the world is rotating from being asleep to being awake. More importantly, the image of the world provides the audience with a reference that is recognizable and easy to understand. The data that were used to generate this animated GIF continue to be used by researchers to study their implications on internet security and ethics.[4,5]

It is highly recommended that you download and view the GIF on your own to appreciate the animation.

CONCLUSIONS

Data visualization is an effective tool to tell complicated stories; sometimes, it’s the only way. Historically, we have been doing this without the aid of personal computers and visual software. In most cases, data visualization was something that was done by hand and carefully illustrated like a piece of art. In these examples, stories from the cholera outbreak, failed military ambition, and an illegal comprehensive internet census have provided us with a better understanding of how our world operates and the impact of these data on our society.

REFERENCES

  1. Johnson S. The Ghost Map: The Story of London’s Most Terrifying Epidemic—And How It Changed Science, Cities and the Modern World. New York, NY, USA: Riverhead Books; 2006.

  2. Joyce H. Minard and Napoleon’s march on Moscow. Significance. 2008;5(3):133-134. doi:10.1111/j.1740-9713.2008.00311.x

  3. Internet Census 2012. http://census2012.sourceforge.net/paper.html. Accessed December 12, 2019.

  4. Krenc T, Hohlfeld O, Feldmann A. An Internet Census Taken by an Illegal Botnet: A Qualitative Assessment of Published Measurements. SIGCOMM Comput Commun Rev. 2014;44(3):103–111. doi:10.1145/2656877.2656893

  5. Dittrich D, Carpenter K, Karir M. The Internet Census 2012 Dataset: An Ethical Analysis. IEEE Technology and Society Magazine. June 2015:40-46. doi:10.1109/MTS.2015.2425592

Developing choropleths using the United States Veterans Integrated System Network (VISN) shapefiles

BACKGROUND

When I want to present VISN-level data, I consider using choropleths because they are visually appealing and provide a good reference to other VISNs. Choropleths are maps that uses polygons or shapes that are shaded according to a metric such that the color indicates the intensity of that metric. For instance, if you wanted to compare population density across different states, you can use a choropleth to illustrate this difference.

An example of a choropleth comes from the Centers for Disease Control and Prevention that illustrates the prevalence of obesity by state. The legend tells us the prevalence of obesity at each state and the colors denote the level of the prevalence. The cranberry color denotes an obesity prevalence of 35% or greater whereas a lighter green color with dots denotes a prevalence of less than 20%. This choropleth provides a quick visual guide on the prevalence of obesity across the United States (U.S.).

Figure 1.png

Motivating example

In past reports, I have generated a choropleth using VISN-level data. Unlike the U.S. shapefile (map files with coordinates; normally with the *.shp extension), the VISN shapefile is specific to the VA and doesn’t not follow the borders of the states used in typical U.S. shapefiles.

In this example, I will provide a step by step guide on how you can generate your own VISN-level choropleth for use in reports and presentations. The files for this tutorial are available on following Dropbox link.

 

TUTORIAL

Step 1: Download QGIS

Download QGIS, which is a free Geographic Information Systems (GIS) software for both the Windows and Mac operating systems. Watch the following video for a step-by-step guide on downloading and using the program. The program is simple to use and does not involve any coding. After you install the software, proceed to the next step. (Contact your local IT support to have this installed on a government-owned system.)

 

Step 2: Download the VISN shapefiles

The VA provides shapefiles online at the following link. Download the file titled FY2017_Q4_VISN.zip. This file will contain all the necessary files that you will need to build your choropleth.

 

Step 3: Download VISN-level data on total population

We will need VISN-level data to join with the VISN-level shapefile in QGIS. You can download the data from the following VA public site. Go to the Population Tables and download the “All Veterans Integrated Service Networks” Excel file. It will contain data on the total population at each VISN.

With QGIS, you will need to have two types of files for the data. I recommend using a text editor (not the Windows native notepad) such as Notepad++ or Brackets. In the text editor, open the file with the data and save it as a *.csv. The reason we do this is to make sure that the data is in text format. There are certain values that you want to ensure include the “0” in front of the other numbers (e.g., “01,” “02,” “03,” etc). If you don’t include the “0” you will not be able to join the data to the shapefiles.

After you save this as a *.csv (please include the extension onto the title), then you can open a new document on Notepad++ and enter the data column format. For instance, if column 1 is in text format, then type “String” for the first column. If the second data column is in numeric format, type “Integer.” We have a total of seven columns; therefore, we need to have seven data formats. See the example below.

Step 4: Open QGIS and add the VISN shapefile layer

Click on the Layer tab > Add Layer > Add Vector Layer and browse for the VISN shapefile.

Click on Open and make sure to click Add to add the VISN shapefile onto your QGIS software workspace. You should see the following image appear.

Notice how the polygons are in the form of the VISNs instead of the states. This is an important difference between what you see with the U.S. shapefiles and the VA shapefiles.

 

Step 5: Add the VISN-level population data

Include the VISN-level population data by downloading it from the VA public site. However, you can also use the Dropbox link that I host with the files already formatted for QGIS here. For this exercise, it would be easier to use the files I provide in the Dropbox link since the formatting may be challenging to implement. For more discussion about the proper formatting, please watch the following video.

You add the VISN-level population data by dragging and dropping it into the Layers panel. Use the file titled “visn_population_2018.csv” and make sure to drop it into the Layers panel. QGIS will automatically recognize the data types because the “visn_population_2018.csvt” file is in the same folder as the “visn_population_2018.csv.”

Step 6: Join the data to the shapefile

Double-click the VISN shapefile in the Layers panel; this will open a new window called the Layers Properties. Click on the JOIN icon and select the data you want to join to the VISN shapefile. Make sure to select VISN for the Join and Target field. This will use the VISN number to join the data to the shapefile. After you select OK, make sure to click on Apply.

Step 7: Adding classes and color

In the Layer Properties window, select the Symbology icon, which will open the menu to add classes and change the color of the different classes. Above the column field, select the Graduated level. In the Column field, select the visn_population_2018_Total, which is the total population of the VISN. Then select Quantile in the Mode field. Change the color ramp field to a blue hue. Click apply and you should immediately see the VISN shapefile file change colors in the workspace.

Your project workspace should look like the following map.

Step 8: Adding labels

Click on the Label icon and select the Single Labels option. Select the Labels variable and then click apply. This may take a while since QGIS has to identify the polygon’s location and insert the labels. The labeling may take about 3-5 minutes because the VISN shapefiles have layers and polygons. I recommend saving this step for when you export the final image generated using the composure function of QGIS to save yourself time.

After the labeling is done, you should see the VISN labels for each polygon.

 

Step 9: Use the composer to finalize your choropleth

The composer is QGIS’s workspace that allows you to customize the choropleth. Select the composer and name it “VISN population” and then select the sections you want to insert into the composer using the Adds New Map to the Layout icon. Once everything is finalized, you can export this as a *.png or *.pdf file. (At this time, you may turn on the labels if you waited to add these at the very end.)

This is what the choropleth looks like after we finish composing it.

Step 10: Add a coordinate reference system (CRS)

Right now, the map is not an accurate portrayal of the United States in relation to the surface of the Earth. It should be more round at the top due to the distance from the North Pole and the fact that the Earth is a sphere. To ensure that we are accurately portraying the U.S., we need to install the appropriate coordinate reference system (CRS). To do this we need to first click on the Properties of the project and select CRS. We add the CRS from the server using the Datum Transformations window. We change both the Source and Destination CRSs and use the USA_Contiguous_Albers_Equal_Area_Conic CRS (ID = EPSG:102003).

Once the CRS is installed into our CRS database, we can select it to change the shape of the map to correctly conform to the shape of the Earth.

This is what the choropleth looks like in the project workspace.

After we add the labels and compose the final elements, the choropleth looks like the following.

CONCLUSIONS

Using choropleths can highlight important differences across VISNs that would be lost in a table or difficult to present in an alternative chart. Based on the population levels, VISN 22, 17, 10 and 8 have large populations of veterans receiving care at the VA. Areas where there is low prevalence of veterans are in VISN 1, 5, 9, and 15. One thing to consider is that we did not normalize the data based on a single denominator. You can play around with how you want to do that by using the U.S. census, which can be found here. As an added exercise, see if you can create something similar using the U.S. shapefiles, which are located here. Additionally, you can use multiple choropleths (small multiples) to show changes across time or another dimension. Choropleths are excellent visuals that can contribute to a narrative; using the VISN shapefiles will allow you to generate visuals that can enhance a report or presentation.

 

REFERENCES

I used the following references to develop this tutorial.

https://www.youtube.com/watch?v=aLmMovuydqI&t=387s

https://www.youtube.com/watch?v=rG6UphZGmg4&t=615s

https://www.youtube.com/watch?v=LNJj3g6SRqU

 

Download QGIS here:

https://www.qgis.org/en/site/forusers/download

 

VA population data:

https://www.va.gov/vetdata/veteran_population.asp

 

VISN shapefiles:

https://catalog.data.gov/dataset/veterans-integrated-services-networks-visn-markets-submarkets-sectors-and-counties-by-geog