time series

Using Stata’s bysort command for panel data in time series analysis

BACKGROUND

Sorting information in panel data is crucial for time series analysis. For example, sorting by the time for time series analysis requires you to use the sort or bysort command to ensure that the panel is ordered correctly. However, when it comes to panel data where you may have to distinguish a patient located at two different sites or a patient with multiple events (e.g., deaths), it’s important to organize the data properly.

You can download the sample data and Stata code at the following links:

Data

Code

 

MOTIVATING EXAMPLE

In this example, we have a data set with time (months) in the column and patients in the rows (this is called a wide format data set). For each month, there are different numbers of observations. For instance, in Month 1, there were 5 observations. But in Month 7 there were only three.

The highlighted boxes indicate a patient was observed at two different sites. There are two ways to approach this: (1) remove the patient from Site B or (2) keep the patient by distinguishing it at each sight. Removing the patient will result in a loss of information for Site B, but keeping the patient complicates the panel data when we convert from wide to long format.

Figure 1.png

Converting this from wide to long format would result in the following panel data. Review each patient, in particular, the months of observations reported for the months. Notice that not all patients have observations for all the months (Months 1 to 7). Some patients have observations for scattered months (e.g., Patient 3). Of note is Patient 2 who has observations at Sites A and B. Since we opted to keep Patient 2 data for Sites A and B, we need to distinguish a method to ensure that the panel data is ordered correctly. Interestingly, Patient 8 has an observed event  (Death) three times at Months 5, 6, and 7. Since a patient should experience death only once, this may be a coding error and should be removed. Using the Stata sort and bysort command will allow us to fix this problem.

Figure 2.png

The bysort command has the following syntax:

bysort varlist1 (varlist2): stata_cmd

Stata orders the data according to varlist1 and varlist2, but the stata_cmd only acts upon the values in varlist1. This is a handy way to make sure that your ordering involves multiple variables, but Stata will only perform the command on the first set of variables.

 

REMOVE REPEATED DEATHS FROM PATIENT 8

First, we want to make sure we eliminate the repeated deaths from Patient 8. We can do this using the bysort command and summing the values of Death. Since Death == 1, we can sum up the total Deaths a patient experiences and drop those values that are greater than 1—because a patient can only die once.

***** Identify patients with repated death events. 
bysort id site (month death): gen byte repeat_deaths = sum(death==1)
drop if repeat_deaths > 1 

The alternative methods use the sort command:

* Alternative Method 1:
by id site (month death), sort: gen byte repeat_deaths = sum(death==1)
drop if repeat_deaths > 1

* Alternative Method 2:
sort id site (month death)
by id site (month death): gen byte repeat_deaths = sum(death ==1)
drop if repeat_deaths > 1
Figure 3.png

Now we have a data set without the unnecessary death values for Patient 8. Therefore, Patient 8 will not be counted in months 6 and 7 because they are no longer contributing to the denominator.

 

COUNT THE NUMBER OF DEATHS PER MONTH

Suppose we want to perform a single group time series analysis. We would want to sum up the number of deaths across the months. We can do this using the bysort command.

First, we have to think about how we want to count death. Since Death == 1, we want to add up the number of Death for each month. Initially, we were worried that Death would be counted two more times for Patient 8, but we solved this problem by removing these events from Patient 8.

Figure 4.png

The following command will yield the above results in a long format.

bysort month: egen byte total_deaths = total(death)

We use the egen command because we are using a more complex function. Detailers on when to use gen versus the egen commands are located at this site.

 

DETERMINING THE DENOMINATOR—COUNTING THE NUMBER OF PATIENTS CONTRIBUTION INFORMATION

Next, we want to determine that number of patient observations that are contributed to each month. To do this, we can use the bysort command again.

***** Determine the denominator -- using bysort and counter variable
gen counter = 1
bysort month: egen byte total_obs = total(counter)

This should yield the following results:

Figure 5.png

CHANGING FROM PATIENT-LEVEL DATA TO SINGLE-GROUP DATA

Currently, the data is set up using the patient-level. We want to change this to the single-group level or the aggregate monthly level. To do this, we have to eliminate the repeated month measurements for our total deaths (numerator) and total observations (denominator).

***** Drop duplicate months
bysort month: gen dup = cond(_N==1, 0, _n)
drop if dup > 1

We can visualize this by plotting two separated lines connected at the values for each month.

****** Plot the total number of deaths and total number of observations
graph twoway (connected total_deaths month, lcol(navy)) ///
             (connected total_obs month, lcol(cranberry) ytitle("Number") ///
	      xtitle("Months") ylab(, nogrid) graphregion(color(white)))
Figure 6.png

We can take this a step further and calculate the prevalence.

***** Estimate the prevalence (per 100 population) and plot
gen prev = (total_deaths / total_obs ) * 100	

graph twoway connected prev month, ytitle("Prevalence of Death" "per 100 population") ///
	     xtitle("Months") ylab(, nogrid) graphregion(color(white))
Figure 7.png

CONCLUSIONS

Using the bysort command can help us fix a variety of data issues with time series analysis. In this example, we have patient-level data that contained deaths for one patient and a patient who was observed at different sites. Using the bysort command to distinguish between sites allowed us to properly identify the patient as unique to the site. Additionally, we used the bysort to identify the patient with multiple deaths and eliminated these values from the aggregate monthly values. Then we finalized out single-group data set by summing the total deaths and observations per month and removing the duplicates.

You can download the Stata code from my Github site.

 

REFERENCES

I used the following references to write this blog.

Stata commands: bysort:

https://stats.idre.ucla.edu/stata/faq/can-i-do-by-and-sort-in-one-command/

 

Stata commands: gen versus egen:

https://stats.idre.ucla.edu/stata/seminars/notes/stata-class-notesmodifying-data/

Counting and Data Manipulation for an ITSA

BACKGROUND

In time series analysis, we are interested on the impact of some exposure over a time period. Exposure can be coded as event==1. If this is time-varying, then the event can occur at any time across a time period. Time series analysis requires us to identify the time when the event first occurred. In most cases this is also considered the post period. In this example, we will label the exposure of interest as event.

Longitudinal data can come in either a wide or long format. However, it is easier to perform longitudinal data analysis in the long format. This assumes that you declare either the xtset or tsset as a panel or time-series data set, respectively.

MOTIVATING EXAMPLE

Let’s assume that we have two subjects (A and B), who can experience an event at any time between some time variable 1 and 5, time(1:5). This is a longitudinal data set in the long format with id as the unique subject-level identifier, the exposure variable of interest event as the exposure, and time as an arbitrary time variable ranging from 1 to 5. The event for subject A occurs at time==3.

Screen Shot 2018-02-18 at 3.13.23 PM.png

Suppose you want to create a variable that counts the number of times the subject has the event. We will call this variable duration.

Screen Shot 2018-02-18 at 3.14.35 PM.png

The following Stata code will generate the duration variable.

by id (time), sort: gen byte duration = sum(event==1)

Sorting by the id and then time will nest the time sequence for each id. The sum() will add all event that is coded as 1.

It’s critical that you put time in parentheses (); otherwise, you can generate incorrect values. For instance, if you make the mistake of typing the Stata code as follows, you will generate a dataset which doesn’t provide the cumulative duration of having the event. Notice how the duration variable only has 1 instead of 1, 2, and 3.

by id time, sort: gen byte duration = sum(event==1)
Screen Shot 2018-02-18 at 3.17.26 PM.png

Similarly, if you use the following code, you will generate the incorrect values. The sum(event)==1 syntax should be sum(event==1). However, this will “flag” the time when the event first occurred, which may be useful in some cases. 

by id (time), sort: gen byte duration = sum(event)==1
Screen Shot 2018-02-18 at 3.18.48 PM.png

Let’s take our example further and generate a variable column that takes into consideration the period before the subject experiences an event. Suppose subject A experiences an event at time==3, but we want to center this as 0 and previous months as -1, -2, and so on. We need to first identify the time when the event occurs and populate that as a new variable, which we will call firstevent. We can use the following State code to generate firstevent based on the condition that the event==1 and the variable it occurs is time==3.

egen firstevent = min(cond(event == 1, time, .)), by(id)
Screen Shot 2018-02-18 at 3.19.41 PM.png

There will be missing values since not all subjects experience the event. To populate the missing values for the subjects with no events (event==0), we need to replace firstevent by identifying the max time of the entire study period using the summary command. Once we have the max time of the study period, we add 1 to this and replace the missing values from the firstevent variable.

replace firstevent = max(time) if firstevent == .
summ time
global maxtime = r(max)

replace firstevent = $maxtime + 1 if firstevent == .
Screen Shot 2018-02-18 at 3.20.31 PM.png

We can subtract the time from the first event to generate a new variable (its) that will capture the negative time before the event occurs and the positive time after the event occurs, centered on when event==1.

by id (time), sort: gen byte its = _n – firstevent
Screen Shot 2018-02-18 at 3.21.29 PM.png

The new variable its is short for interrupted time series analysis. An investigator can use the its variable to plan any interrupted times series analysis without having to go through the ordeal of generating this variable using other software.

Here is a summary of the entire Stata code, which you can also find on my Github page:

***** Declare XTSET panel dataset.
* Variable list: id event time 
* id        =   subject identifier
* event     =   exposure of interest
* time      =   time interval

**** Create the duration variable to capture time after event.
by id (time), sort: gen byte duration = sum(event==1)

**** Create a variable for the time before the event.
egen firstevent = min(cond(event == 1, time, .)), by(id)

**** Identify the maxtime.
summ time
global maxtime = r(max)

**** Replace missing data with the maxtime + 1.
replace firstevent = $maxtime + 1 if firstevent == .

**** Create its to capture time before and after event. 
by id (time), sort: gen byte its = _n - firstevent

ACKNOWLEDGEMENTS

I used several online references to develop this tutorial for Stata. Nicholas J. Cox has some excellent tutorials that was influential in developing this piece.

Cox N. First and last occurrences in panel data. From https://www.stata.com/support/faqs/data-management/first-and-last-occurrences/

The Statlist forum was also helpful; in particular, the following discussion threads.

https://www.statalist.org/forums/forum/general-stata-discussion/general/965910-how-generate-variable-that-indicates-current-and-prior-event-occurrence-for-id-in-panel-data

https://www.statalist.org/forums/forum/general-stata-discussion/general/1297707-creating-duration-variable-for-panel-data

https://www.stata.com/statalist/archive/2010-12/msg00193.html

https://www.stata.com/statalist/archive/2012-09/msg00286.html

The UCLA Institute for Digital Research and Education has a tutorial on using _N and _n to count in Stata.

Counting from _N to _N. UCLA: Statistical Consulting Group. From https://stats.idre.ucla.edu/stata/seminars/notes/counting-from-_n-to-_n/

Communicating data effectively with data visualization - Part 4 (Time series)

Introduction

An important element of data visualization is to tell a story. To do that, we should have the end in mind. Namely, what is it you want to share with your audience?

Often, time series data can do this using some clever data visualization. Typically, this is presented on an XY plane where time is presented on the X-axis and the value of interest is presented on the Y-axis. We will not go into time series analysis, which involves a lot more than just plotting the data. However, we will go over the proper way in which to present your time series data visually.

 

Motivating example

We will use data from the National Health Expenditure Account (NHEA), which contains historical data on health expenditures in the U.S. from 1960 to 2016. The costs presented by the NHEA are properly adjusted for inflation. You can find the data at this link: https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/NationalHealthExpendData/NationalHealthAccountsHistorical.html

 

Time series data

To visualize time series data, it is best to have increments of time that are equally spaced in the X-axis. We use Excel to illustrate these examples. Figure 1 illustrates the annual interval of national health expenditures ($ billions) in the United States from 1960 to 2016. The outcome (national health expenditure) is on the Y-axis and time (year) is on the X-axis. Notice that each time increment is one year and evenly spaced across the X-axis. This allows the eyes to intuitively see the changes across time in the U.S. national health expenditure.

Figure 1. National Healthcare Expenditure in the United States, 1960 to 2016.

Figure 1.png

What if the story is to see highlight health expenditures in the last decade? How would we do this?

First, we can use the same data and restrict the X-axis to 2007 to 2016 as in Figure 2.

 

Figure 2. National Health Expenditures in the United States, 2007 to 2016.

Screen Shot 2018-01-24 at 6.02.04 PM.png

Figure 2 doesn’t seem interesting. There is an increase in health expenditures from 2007 to 2016, but this doesn’t seem significant. However, there is a 45% increase from 2007 to 2016 in health expenditures ($2,295 billion to $3,337 billion). Figure 2 doesn’t convey this increase because there is a lot of white space between the lowest health expenditure value in 2007 and $0.

One way to illustrate the large increase in health expenditure is to truncate the Y-axis. In previous articles, we stressed that truncated axis can distort and trick the mind into seeing large differences where they don’t exist. However, this same technique can be used to make sure that differences that exist are not misinterpreted as not visually significant. According to Tufte:

In general, in a time-series, use a baseline that shows the data not the zero point. If the zero point reasonably occurs in plotting the data, fine. But don't spend a lot of empty vertical space trying to reach down to the zero point at the cost of hiding what is going on in the data line itself.[1]

In other words, time series data should focus on the area of the timeline that is interesting. The graphic should eliminate the white space and show the data horizontally for time series visuals.

Eliminating the white space and identifying the baseline value as $2,200 billion instead of $0 changes the figure as illustrated in Figure 3.

 

Figure 3. National Health Expenditures in the United States, 2007 to 2016.

Figure 3.png

Figure 3 illustrates the increase in national expenditure in the last decade better than Figure 2 and maintains the narrative that there was a visually significant increase.

Putting these concepts together (along with Tufte’s other principles), we can generate a similar figure using R (Figure 4). The R code is listed below.

# Plot trend - without truncation
library(lattice)
  
xyplot(y~x, 
       xlab=list(label="Year", cex=1.25),
       ylab=list(label="National Health Expenditures ($ Billions)", cex=1.25),
       main=list(lable="National Health Expenditure (2007 to 2016)", cex=2),
       par.settings = list(axis.line = list(col="transparent")),
       panel = function(x, y,...) { 
          panel.xyplot(x, y, col="darkblue", pch=16, cex=2, type="o")
          panel.rug(x, y, col=1, x.units = rep("snpc", 2), y.units = rep("snpc", 2), ...)})

Figure 4. National Health Expenditures in the United States, 2007 to 2016.

Figure 4 (darkblue).png

Figure 4 incorporates the use of Tufte’s principles on data-ink ratio and truncation on the y-axis to highlight the change in National Health Expenditure between 2006 go 2017.

 

Conclusions

With time series data, truncating the Y-axis to eliminate white space and show the data horizontally is appropriate when telling the story of what’s happening across time. Using zero as the baseline for the Y-axis is appropriate if it is reasonable. However, do not compromise the story by having the Y-axis extend all the way to zero if it doesn’t tell the story properly. Knowing when and how to truncate the Y-axis will help you explain to your audience the significance of a change across a specific period in time.

 

References

1. Edward Tufte forum: baseline for amount scale [Internet]. [cited 2018 Jan 14];Available from: https://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=00003q