How many drug shortages are there?

It was recently brought to my attention that there is a discrepancy between the number of reported shortages on the drugshortagescanada.ca website, and my own reckoning on display on the drug shortages dashboard I maintain.

This blog post is a play-by-play as I dig through data to figure out why.

Drug Shortages Canada (DSC) reports the following on their summary page today (April 18, 2020):

Shortage reports          10039
Actual shortage: i         2029 (20%)
Anticipated shortage:        67 (1%)
Avoided shortage:           259 (3%)
Resolved:                  7684 (77%)


Discontinuation Reports    1826
To be discontinued:         117 (6%)
Discontinued:              1684 (92%)
Reversed:                    25 (1%)

Late reports               1774
Overdue reports             266

Today, on my dashboard I report a total 1816 active shortages. Let’s see where this difference of 213 shortages comes from (their 2029 vs my 1816).

First, let’s be sure we can reproduce the DSC figures. First, get set up and grab the entire DSC database:

library(tidyverse)
library(drugshortagesr)     # devtools::install_github('pipitone/drugshortagesr')
dsc = dsc_search()
Code language: PHP (php)

Then, we can count the reports by their “status” field:

dsc %>% count(status)
# A tibble: 7 x 2
  status                   n
  <chr>                <int>
1 active_confirmed      2029
2 anticipated_shortage    67
3 avoided_shortage       259
4 discontinued          1684
5 resolved              7684
6 reversed                25
7 to_be_discontinued     117
Code language: HTML, XML (xml)

Great, this matches perfectly with what is reported on the DSC.

When I calculate the number of active shortages I also require that the status flag is set to “active_confirmed” but I add an extra restriction that the current date falls between the “actual_start_date” and the end date (either the “actual_end_date” or “estimated_end_date” if provided).

FAR_FUTURE = ymd("2106-02-07")
dsc %>% filter(status == "active_confirmed", 
    now() %within% interval(
      actual_start_date, 
      coalesce(as_date(actual_end_date), as_date(estimated_end_date), FAR_FUTURE))) %>%
    count() %>% pull()Code language: JavaScript (javascript)
[1] 1816Code language: JSON / JSON with Comments (json)

The extra date restriction here is vestigial code from a time when I calculated the number of shortages expected in the current week and so needed to do some fancy date overlap checks. But, it shouldn’t matter right? An “active_confirmed” shortage should have as start date that is on or before today’s date and an (estimated) end date in the future if any is supplied.

But okay, perhaps the issue is that reported end dates shouldn’t be trusted until the shortage is over, so let’s ignore them:

dsc %>% filter(status == "active_confirmed", 
    now() >= actual_start_date) %>% 
    count() %>% pull()Code language: JavaScript (javascript)
[1] 1938Code language: JSON / JSON with Comments (json)

Okay! We found 122 shortages if we ignore end dates that happen in the past. We’re still off by 91 shortages which clearly means there’s an issue with the start dates also.

Surely we can’t have “active confirmed” shortages with start dates in the future?

dsc %>% filter(status == "active_confirmed", 
    now() < actual_start_date) %>% 
    count() %>% pull()Code language: HTML, XML (xml)
[1] 0    Code language: JSON / JSON with Comments (json)

Phew. But wait… if there are no shortages with future dates, how come we’re missing shortages when we count all that have start dates that are today or in the past?

As a sanity check, let’s break down the “active confirmed” shortages by the year of their reported start date:

dsc %>% filter(status == "active_confirmed") %>%
    count(year(actual_start_date))Code language: JavaScript (javascript)
# A tibble: 14 x 2
   `year(actual_start_date)`     n
                       <dbl> <int>
 1                      2006     1
 2                      2008     1
 3                      2010     1
 4                      2011     4
 5                      2012    11
 6                      2013    10
 7                      2014    63
 8                      2015    27
 9                      2016   102
10                      2017   324
11                      2018   362
12                      2019   509
13                      2020   523
14                        NA    91Code language: HTML, XML (xml)

Oof. First things first, we have found our missing 91 shortages: they are missing start dates and so I end up filtering them out.

But look at the other counts… This says there are shortages reportedly ongoing since more than a decade ago. I don’t believe it, but who am I.

So just to be absolutely sure we’ve accounted correctly for the missing shortages (all 213 of them), let’s count all of the shortages that have empty start dates or end dates (actual or if not available, estimated) in the past:

dsc %>% filter(status == "active_confirmed") %>%
  mutate(end_date = coalesce(as_date(actual_end_date), as_date(estimated_end_date), FAR_FUTURE)) %>%
  filter(is.na(actual_start_date) | end_date < now()) %>%
  count() %>% pull()Code language: HTML, XML (xml)
[1] 213Code language: JSON / JSON with Comments (json)

tl;dr: My count of active shortages differs from what shows up on https://drugshortagescanada.ca because I’ve filtered out reports that have empty start dates and end dates in the past.

Also, in the process of investigating this we found some ancient “active” shortages which tells us we probably found some long-forgotten shortage reports that likely need to be updated and closed.

In short, the DSC shortage reports are messy real world data with gaps and inconsistencies. There is certainly more that the DSC could be doing to clean up their dataset.

But, what should I do? I can alter my code so that it counts all “active_confirmed” records regardless of dates, but that only works for a single point in time… i.e. now. When I want to compute historical totals, then I need to use the dates provided (although I can ignore the estimated dates), and so my totals are going to be different than what is reported by the DSC. So be it, I guess.