It was recently brought to my attention that there is a discrepancy between the number of reported shortages on the drugshortagescanada.ca website, and my own reckoning on display on the drug shortages dashboard I maintain.
This blog post is a play-by-play as I dig through data to figure out why.
Drug Shortages Canada (DSC) reports the following on their summary page today (April 18, 2020):
Shortage reports 10039
Actual shortage: i 2029 (20%)
Anticipated shortage: 67 (1%)
Avoided shortage: 259 (3%)
Resolved: 7684 (77%)
Discontinuation Reports 1826
To be discontinued: 117 (6%)
Discontinued: 1684 (92%)
Reversed: 25 (1%)
Late reports 1774
Overdue reports 266
Today, on my dashboard I report a total 1816 active shortages. Let’s see where this difference of 213 shortages comes from (their 2029 vs my 1816).
First, let’s be sure we can reproduce the DSC figures. First, get set up and grab the entire DSC database:
library(tidyverse)
library(drugshortagesr) # devtools::install_github('pipitone/drugshortagesr')
dsc = dsc_search()
Code language: PHP (php)
Then, we can count the reports by their “status” field:
dsc %>% count(status)
# A tibble: 7 x 2
status n
<chr> <int>
1 active_confirmed 2029
2 anticipated_shortage 67
3 avoided_shortage 259
4 discontinued 1684
5 resolved 7684
6 reversed 25
7 to_be_discontinued 117
Code language: HTML, XML (xml)
Great, this matches perfectly with what is reported on the DSC.
When I calculate the number of active shortages I also require that the status flag is set to “active_confirmed” but I add an extra restriction that the current date falls between the “actual_start_date” and the end date (either the “actual_end_date” or “estimated_end_date” if provided).
FAR_FUTURE = ymd("2106-02-07")
dsc %>% filter(status == "active_confirmed",
now() %within% interval(
actual_start_date,
coalesce(as_date(actual_end_date), as_date(estimated_end_date), FAR_FUTURE))) %>%
count() %>% pull()
Code language: JavaScript (javascript)
[1] 1816
Code language: JSON / JSON with Comments (json)
The extra date restriction here is vestigial code from a time when I calculated the number of shortages expected in the current week and so needed to do some fancy date overlap checks. But, it shouldn’t matter right? An “active_confirmed” shortage should have as start date that is on or before today’s date and an (estimated) end date in the future if any is supplied.
But okay, perhaps the issue is that reported end dates shouldn’t be trusted until the shortage is over, so let’s ignore them:
dsc %>% filter(status == "active_confirmed",
now() >= actual_start_date) %>%
count() %>% pull()
Code language: JavaScript (javascript)
[1] 1938
Code language: JSON / JSON with Comments (json)
Okay! We found 122 shortages if we ignore end dates that happen in the past. We’re still off by 91 shortages which clearly means there’s an issue with the start dates also.
Surely we can’t have “active confirmed” shortages with start dates in the future?
dsc %>% filter(status == "active_confirmed",
now() < actual_start_date) %>%
count() %>% pull()
Code language: HTML, XML (xml)
[1] 0
Code language: JSON / JSON with Comments (json)
Phew. But wait… if there are no shortages with future dates, how come we’re missing shortages when we count all that have start dates that are today or in the past?
As a sanity check, let’s break down the “active confirmed” shortages by the year of their reported start date:
dsc %>% filter(status == "active_confirmed") %>%
count(year(actual_start_date))
Code language: JavaScript (javascript)
# A tibble: 14 x 2
`year(actual_start_date)` n
<dbl> <int>
1 2006 1
2 2008 1
3 2010 1
4 2011 4
5 2012 11
6 2013 10
7 2014 63
8 2015 27
9 2016 102
10 2017 324
11 2018 362
12 2019 509
13 2020 523
14 NA 91
Code language: HTML, XML (xml)
Oof. First things first, we have found our missing 91 shortages: they are missing start dates and so I end up filtering them out.
But look at the other counts… This says there are shortages reportedly ongoing since more than a decade ago. I don’t believe it, but who am I.
So just to be absolutely sure we’ve accounted correctly for the missing shortages (all 213 of them), let’s count all of the shortages that have empty start dates or end dates (actual or if not available, estimated) in the past:
dsc %>% filter(status == "active_confirmed") %>%
mutate(end_date = coalesce(as_date(actual_end_date), as_date(estimated_end_date), FAR_FUTURE)) %>%
filter(is.na(actual_start_date) | end_date < now()) %>%
count() %>% pull()
Code language: HTML, XML (xml)
[1] 213
Code language: JSON / JSON with Comments (json)
tl;dr: My count of active shortages differs from what shows up on https://drugshortagescanada.ca because I’ve filtered out reports that have empty start dates and end dates in the past.
Also, in the process of investigating this we found some ancient “active” shortages which tells us we probably found some long-forgotten shortage reports that likely need to be updated and closed.
In short, the DSC shortage reports are messy real world data with gaps and inconsistencies. There is certainly more that the DSC could be doing to clean up their dataset.
But, what should I do? I can alter my code so that it counts all “active_confirmed” records regardless of dates, but that only works for a single point in time… i.e. now. When I want to compute historical totals, then I need to use the dates provided (although I can ignore the estimated dates), and so my totals are going to be different than what is reported by the DSC. So be it, I guess.