repoRter.nih: a convenient R interface to the NIH RePORTER Project API

Introduction

The US National Institute of Health (NIH) received funding of approximately $42 billion in fiscal year 2022; $31 billion (72%) of this was awarded by the NIH in the form of research grant funding to hospitals, medical colleges, non-profits, businesses, and other organizations based in the U.S. and abroad.1 The NIH maintains a publicly available database called “RePORTER” to track this substantial flow of grant funding and makes it available to the public via a web-based query interface as well as an API.

“The NIH RePORTER APIs is designed to programmatically expose relevant scientific awards data from both NIH and non-NIH federal agencies for the consumption of project teams or external 3rd party applications to support reporting, data analysis, data integration or to satisfy other business needs as deemed pertinent.”
–NIH RePORTER v2 API Documentation

This data can have significant value for many audiences, including researchers, investors, industry, watchdogs/public advocates, and R users. But constructing queries and retrieving results programmatically involves some coding overhead which can be a challenge for those not familiar with RESTful APIs and JSON; it takes some effort even for those who are. The repoRter.nih package aims to simplify this task for the typical analyst scripting in R.

Getting Started

Installation

This package (latest stable release) can be installed from CRAN the usual way:

install.packages("repoRter.nih")

The current dev version can be installed from github, on the dev branch:

devtools::install_github('bikeactuary/repoRter.nih@dev')

I welcome R developers more capable than myself to collaborate on improving the source code, documentation, and unit testing in this package.

Basic Workflow

library(tibble)
library(repoRter.nih)
library(ggplot2)
library(ggrepel)
library(dplyr)
library(scales)
library(tufte)

The make_req() method is used to generate a valid JSON request object. The req can subsequently be passed to the RePORTER Project API and results retrieved via the get_nih_data() method.

Generating the request:

# all projects funded by the Paycheck Protection Act, Coronavirus Response and
# Relief Act, and American Rescue Plan, in fiscal year 2021
req <- make_req(criteria =
                  list(fiscal_years = 2021,
                       covid_response = c("C4", "C5", "C6")))
#> This is your JSON payload:
#> {
#>     "criteria": {
#>         "fiscal_years": [
#>             2021
#>         ],
#>         "covid_response": [
#>             "C4",
#>             "C5",
#>             "C6"
#>         ],
#>         "use_relevance": false,
#>         "include_active_projects": false,
#>         "exclude_subprojects": false,
#>         "multi_pi_only": false,
#>         "newly_added_projects_only": false,
#>         "sub_project_only": false
#>     },
#>     "offset": 0,
#>     "limit": 500
#> }
#> 
#> If you receive a non-200 API response, compare this formatting (boxes, braces, quotes, etc.) to the 'Complete Payload' schema provided here:
#> https://api.reporter.nih.gov/?urls.primaryName=V2.0#/Search/post_v2_projects_search

Sending the request and retrieving results:

res <- get_nih_data(req)
#> Retrieving first page of results (up to 500 records)
class(res)
#> [1] "tbl_df"     "tbl"        "data.frame"

A tibble is returned containing 43 columns. This data is not flat - several columns are nested data.frames and lists (of variable length vectors and data.frames of varying height).

res %>% glimpse(width = getOption("cli.width"))
#> Rows: 323
#> Columns: 44
#> $ appl_id                  <int> 10425707, 10255113, 10403857, 10258548, 10439178, 10446500, ~
#> $ subproject_id            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
#> $ fiscal_year              <int> 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, ~
#> $ project_num              <chr> "3P20GM104417-08S1", "3P20GM104417-07S1", "3R01ES028615-07S1~
#> $ project_serial_num       <chr> "GM104417", "GM104417", "ES028615", "ES028615", "DC019579", ~
#> $ organization             <df[,17]> <data.frame[31 x 17]>
#> $ award_type               <chr> "3", "3", "3", "3", "7", "3", "1", "3", "3", NA, "1", "1~
#> $ activity_code            <chr> "P20", "P20", "R01", "R01", "U01", "R01", "R01", "U01", "U19~
#> $ award_amount             <int> 681188, 1115953, 300000, 1609765, 877287, 348242, 667277, 26~
#> $ is_active                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS~
#> $ project_num_split        <df[,7]> <data.frame[31 x 7]>
#> $ principal_investigators  <list> [<data.frame[1 x 7]>], [<data.frame[1 x 7]>], [<data.frame[2~
#> $ contact_pi_name          <chr> "ADAMS, ALEXANDRA K.", "ADAMS, ALEXANDRA K.", "AL-HENDY, ~
#> $ program_officers         <list> [<data.frame[1 x 4]>], [<data.frame[1 x 4]>], [<data.frame[~
#> $ agency_ic_admin          <df[,3]> <data.frame[31 x 3]>
#> $ agency_ic_fundings       <list> [<data.frame[1 x 5]>], [<data.frame[1 x 5]>], [<data.frame[1~
#> $ cong_dist                <chr> "MT-01", "MT-01", "IL-01", "IL-01", "MA-08", "MA-08", "MO-0~
#> $ spending_categories      <list> <NULL>, <44, 89, 176, 180, 4835, 5009, 5011, 246, 3641, 2~
#> $ project_start_date       <chr> "2021-09-01T12:09:00Z", "2020-11-17T12:11:00Z", "2020-11-11~
#> $ project_end_date         <chr> "2023-08-31T12:08:00Z", "2023-08-31T12:08:00Z", "2023-07-31T~
#> $ organization_type        <df[,3]> <data.frame[31 x 3]>
#> $ opportunity_number       <chr> "PAR-18-264", "PA-20-135", "PA-20-272", "PA-20-135", "RFA-O~
#> $ full_study_section       <df[,6]> <data.frame[31 x 6]>
#> $ award_notice_date        <chr> "2021-09-21T12:09:00Z", "2020-11-17T12:11:00Z", "2021-08-31T~
#> $ is_new                   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS~
#> $ mechanism_code_dc        <chr> "RC", "RC", "RP", "RP", "RP", "RP", "RP", "RP", "RP", "NS~
#> $ core_project_num         <chr> "P20GM104417", "P20GM104417", "R01ES028615", "R01ES028615", ~
#> $ terms                    <chr> "<Risk Behaviors><Risky Behavior><at risk behavior><Rural~
#> $ pref_terms               <chr> "Affect;Age;Alaska Native;American Indians;Applications Gran~
#> $ abstract_text            <chr> "The mission of the Center for American Indian and Rural Hea~
#> $ project_title            <chr> "Center for American Indian and Rural Health Equity", "Cente~
#> $ phr_text                 <chr> "The mission of the Center for American Indian and Rural Hea~
#> $ spending_categories_desc <chr> NA, "American Indian or Alaska Native; Behavioral and Social~
#> $ agency_code              <chr> "NIH", "NIH", "NIH", "NIH", "NIH", "NIH", "NIH", "NIH", "NIH~
#> $ covid_response           <list> "C6", "C4", "C6", "C4", "C4", "C6", "C6", "C6", "C6", "C6", ~
#> $ arra_funded              <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", ~
#> $ budget_start             <chr> "2021-09-01T12:09:00Z", "2020-11-17T12:11:00Z", "2021-09-01T~
#> $ budget_end               <chr> "2023-08-31T12:08:00Z", "2023-08-31T12:08:00Z", "2023-07-31T~
#> $ cfda_code                <chr> "859", "310", "113", "310", "310", "855", "855", "855", "855~
#> $ funding_mechanism        <chr> "Research Centers", "Research Centers", "Non-SBIR/STTR", "N~
#> $ direct_cost_amt          <int> 616778, 1006607, 297064, 1560464, 569403, 207287, 473684, 15~
#> $ indirect_cost_amt        <int> 64410, 109346, 2936, 49301, 307884, 140955, 193593, 116250, ~
#> $ project_detail_url       <chr> "https://reporter.nih.gov/project-details/10425707", "https:~
#> $ date_added               <chr> "2021-09-25T04:09:53Z", "2020-11-21T07:11:17Z", "2021-09-04T~

Criteria-Field Translation

A dataset (nih_fields) is provided with this package to assist in translating between field names used in the payload criteria, column names in the return data, and field names used in the include_fields, exclude_fields, and sort_field arguments.

data("nih_fields")
nih_fields %>% print
#> # A tibble: 43 x 5
#>    payload_name            response_name      include_name     return_class mod_ind
#>    <chr>                   <chr>              <chr>            <chr>          <int>
#>  1 appl_ids                appl_id            ApplId           integer            1
#>  2 <NA>                    subproject_id      SubprojectId     character          0
#>  3 fiscal_years            fiscal_year        FiscalYear       integer            1
#>  4 project_nums            project_num        ProjectNum       character          1
#>  5 serial_num              project_serial_num ProjectSerialNum character          1
#>  6 <NA>                    organization       Organization     data.frame         0
#>  7 award_types             award_type         AwardType        character          1
#>  8 activity_codes          activity_code      ActivityCode     character          1
#>  9 award_amount_range      award_amount       AwardAmount      integer            1
#> 10 include_active_projects is_active          IsActive         logical            1
#> # i 33 more rows

Some fields can not be used as filtering criteria - these will show NA in the payload_name column.

Generating Requests

Most of the detail (and function documentation) is around the many parameters available in RePORTER to filter/search project records. Let’s get into some of the capabilities.

Default Request

If no arguments are supplied, the default behavior of make_req() is to generate a request for all projects funded in fiscal_years = lubridate::year(Sys.Date()). Limiting requests to a single year is often necessary (depending on additional filtering criteria used) due to a RePORTER restriction that a maximum of 10K records may be returned from any result set. There are currently ~2.6M projects in the database going back to fiscal year 1985, and each fiscal year tends to have 70-100K projects, so the 10K limit can be restrictive to the user wanting a broad search.

req <- make_req()
#> This is your JSON payload:
#> {
#>     "criteria": {
#>         "fiscal_years": [
#>             2024
#>         ],
#>         "use_relevance": false,
#>         "include_active_projects": false,
#>         "exclude_subprojects": false,
#>         "multi_pi_only": false,
#>         "newly_added_projects_only": false,
#>         "sub_project_only": false
#>     },
#>     "offset": 0,
#>     "limit": 500
#> }
#> 
#> If you receive a non-200 API response, compare this formatting (boxes, braces, quotes, etc.) to the 'Complete Payload' schema provided here:
#> https://api.reporter.nih.gov/?urls.primaryName=V2.0#/Search/post_v2_projects_search

The method prints a helpful message to the console in addition to returning the JSON. Set message = FALSE if you wish to suppress this message.

Limiting Data Retrieved

You can limit both the width and height of the result set retrieved from the API.

Fields

We probably will not need to fetch every field every time. The include_fields argument is provided to specify a limited set of fields to be returned. Alternatively, fields may be excluded using exclude_fields.

Records (projects)

This package provides the ability to retrieve only a limited number of result pages via the max_pages argument. This can be useful for developing/testing your queries (and for reducing time to render package documentation). Each page has a record count equal to limit - so setting max_pages = 5 with the default limit = 500 (the maximum permitted by RePORTER) in make_req() will result in up to 2,500 total records returned.

Ex. 1 - Limiting results and selecting fields

data("nih_fields")
fields <- nih_fields %>%
  filter(response_name %in% 
           c("appl_id", "subproject_id", "project_title", "fiscal_year",
             "award_amount", "is_active", "project_start_date")) %>%
  pull(include_name)

req <- make_req(include_fields = fields,
                limit = 500,
                message = FALSE) # default
res <- get_nih_data(query = req,
                    max_pages = 1)
#> Retrieving first page of results (up to 500 records)
#> max_pages set to 1 by user. Result set contains 160 pages. Only partial results will be retrieved.

res %>% glimpse(width = getOption("cli.width"))
#> Rows: 500
#> Columns: 7
#> $ appl_id            <int> 10824314, 10914092, 10836918, 10767874, 10909117, 10946856, 108090~
#> $ subproject_id      <chr> NA, NA, NA, "6891", "8632", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
#> $ fiscal_year        <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, ~
#> $ award_amount       <int> 78892, 191250, 249366, 351922, 183746, 50000, 858535, 44553, 33305~
#> $ is_active          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE,~
#> $ project_start_date <chr> "2022-05-01T12:05:00Z", "2023-08-25T12:08:00Z", "2024-02-20T12:02:~
#> $ project_title      <chr> "Elucidating novel molecular mechanisms of irisin-mediated effects~

Some Vanilla Criteria

Many criteria are passed as vectors within the criteria list argument. We will cover some of the most useful examples:

Ex. 3 - Coronavirus/Covid-19 research

We already provided one example of this search criteria above. Let’s mix it up and request all Covid response projects.

## all projects funded by the Paycheck Protection Act, Coronavirus Response and Relief Act,
## and American Rescue Plan, over all years
req <- make_req(criteria =
                  list(covid_response = c("All")),
                include_fields = nih_fields %>%
                  filter(payload_name %in% c("award_amount_range", "covid_response"))
                %>% pull(include_name))
#> This is your JSON payload:
#> {
#>     "criteria": {
#>         "covid_response": [
#>             "All"
#>         ],
#>         "use_relevance": false,
#>         "include_active_projects": false,
#>         "exclude_subprojects": false,
#>         "multi_pi_only": false,
#>         "newly_added_projects_only": false,
#>         "sub_project_only": false
#>     },
#>     "include_fields": [
#>         "AwardAmount",
#>         "CovidResponse"
#>     ],
#>     "offset": 0,
#>     "limit": 500
#> }
#> 
#> If you receive a non-200 API response, compare this formatting (boxes, braces, quotes, etc.) to the 'Complete Payload' schema provided here:
#> https://api.reporter.nih.gov/?urls.primaryName=V2.0#/Search/post_v2_projects_search

res <- get_nih_data(req, max_pages = 1)
#> Retrieving first page of results (up to 500 records)
#> max_pages set to 1 by user. Result set contains 9 pages. Only partial results will be retrieved.

Let’s inspect the result:

res$covid_response %>% class()
#> [1] "list"
res$covid_response[[1]]
#> [1] "Reg-CV"

covid_response is a nested list (with character vectors of variable length) within the return tibble. We can use flatten_result = TRUE here - elements of each vector will be collapsed to a single string delimited by “;”, massaging the list to a single character vector.

## all projects funded by the Paycheck Protection Act, Coronavirus Response and Relief Act,
## and American Rescue Plan, in fiscal year 2021
req <- make_req(criteria =
                  list(covid_response = c("All")),
                message = FALSE)

res <- get_nih_data(req,
                    flatten_result = TRUE)
unique(res$covid_response)
#> [1] "Reg-CV"    "CV"        "C3"        "C4"        "C6"        "C6;Reg-CV" "C5"       
#> [8] "C5;Reg-CV" "C4;Reg-CV"

Some projects are being funded from multiple sources. Summarizing all Covid-related project awards:

library(ggplot2)

res %>%
  left_join(covid_response_codes, by = "covid_response") %>%
  mutate(covid_code_desc = case_when(!is.na(fund_src) ~ paste0(covid_response, ": ", fund_src),
                                     TRUE ~ paste0(covid_response, " (Multiple)"))) %>%
  group_by(covid_code_desc) %>%
  summarise(total_awards = sum(award_amount) / 1e6) %>%
  ungroup() %>%
  arrange(desc(covid_code_desc)) %>%
  mutate(prop = total_awards / sum(total_awards),
         csum = cumsum(prop),
         ypos = csum - prop/2 ) %>%
  ggplot(aes(x = "", y = prop, fill = covid_code_desc)) +
  geom_bar(stat="identity") +
  geom_text_repel(aes(label =
                        paste0(dollar(total_awards,
                                      accuracy = 1,
                                      suffix = "M"),
                               "\n", percent(prop, accuracy = .01)),
                      y = ypos),
                  show.legend = FALSE,
                  nudge_x = .8,
                  size = 3, color = "grey25") +
  coord_polar(theta ="y") +
  theme_void() +
  theme(legend.position = "right",
        legend.title = element_text(colour = "grey25"),
        legend.text = element_text(colour="blue", size=6, 
                                   face="bold"),
        plot.title = element_text(color = "grey25"),
        plot.caption = element_text(size = 6)) +
  labs(caption = "Data Source: NIH RePORTER API v2") +
  ggtitle("Legislative Source for NIH Covid Response Project Funding")

A second dataset is provided to translate the covid_response codes; it includes both the long-form and a shorter version of the source name.

data("covid_response_codes")
covid_response_codes %>% print
#> # A tibble: 6 x 3
#>   covid_response funding_source                                                        fund_src
#>   <chr>          <chr>                                                                 <chr>   
#> 1 Reg-CV         NIH regular appropriations funding                                    NIH Reg~
#> 2 CV             Coronavirus Preparedness and Response Supplemental Appropriations Ac~ Coronav~
#> 3 C3             CARES Act (Coronavirus Aid, Relief, and Economic Security Act), 2020  CARES A~
#> 4 C4             Paycheck Protection Program and Health Care Enhancement Act, 2020     PPP & H~
#> 5 C5             Coronavirus Response and Relief Supplemental Appropriations Act, 2021 Coronav~
#> 6 C6             American Rescue Plan Act of 2021                                      America~

Some Rocky Road Criteria

Other criteria provide search and filtering capability on many of the nested data elements. These criteria are passed as lists and must include a value for each of the named elements within.

Large Result Sets

The RePORTER API provides no direct way to obtain complete result sets when searches yield over 10,000 records. get_nih_data() provides the return_meta argument which is defaulted to FALSE. When set to TRUE and combined with a little programming, you can easily obtain full result sets well beyond the 10K limit. One approach may be the following:

  1. Obtain a sample from your full result set by making the query you desire and calling get_nih_data() with max_pages = 1 (or some small number of pages); also set return_meta = TRUE in order to determine the total number of records in the full result set
  2. Calculate quantiles for the sample distribution of a column of your choice (e.g. award_amount)
  • Set the # of quantiles such that you can confidently infer that the number of records within each quantile range will contain <10K records within the full result set
  1. Iterate over your quantiles making separate requests, passing the endpoints of each quantile to award_amount_range criteria
  • Wait until the end to flatten the combined results since some columns may flatten differently on smaller individual result sets, causing problems in combining them after flattening
  1. Bind your list of results together
  2. Flatten the complete result set, if desired

Below is an implementation of the above logic:

all_res <- list()
for(y in 2017:2021) { ## five years to loop over, each year is ~80K records
  ## We only need the AwardAmount for quantiles
  req_sample <- make_req(criteria = list(fiscal_years = y),
                         include_fields = "AwardAmount")
  
  ## get a sample of the result set - 1000 records should be enough
  ## return the metadata
  res_sample <- get_nih_data(req_sample, max_pages = 2, return_meta = TRUE)
  
  paste0("There are ", res_sample$meta$total, " results for fiscal year ", y) %>%
    print()
  
  ## deciles of award amount - each decile should contain ~7,314.2 records, approximately
  qtiles <- res_sample$records %>% pull(award_amount) %>% quantile(na.rm = TRUE, probs = seq(.1, 1, .1))
  
  ## list for qtile results (full year)
  this_res <- list()
  ## for each qtile
  for (i in 1:length(qtiles)) {
    if (i == 1) {
      award_min <- 0
    } else {
      award_min <- ceiling(qtiles[i-1])+.01
    }
    if (i == length(qtiles)) {
      award_max <- 1e9 ## arbitrarily huge
    } else {
      award_max <- ceiling(qtiles[i])
    }
    req <- make_req(criteria = list(fiscal_years = y,
                                    award_amount_range = list(min_amount = award_min,
                                                              max_amount = award_max)))
    ## result set for quantile
    this_res[[i]] <- get_nih_data(req, flatten_result = FALSE)
  }
  ## list of result sets for each year
  yr_res[[y %>% as.character()]] <- this_res
}

## shape it up
all_res <- unlist(yr_res, recursive = FALSE) %>%
  bind_rows() %>%
  flatten(recursive = FALSE) %>%
  clean_names()

## pull out everything that is flat
flat_columns <- all_res %>%
  select_if(is.atomic)

## everything that isn't
annoying_columns <- all_res %>%
  select_if(!is.atomic)

Note that using award_amount for this purpose will omit records with missing values. If you need these included, you may consider similar logic applied to an alternative field such as award_notice_date.

Additional Resources

  • The RePORTER web interface and official API documentation are useful for getting familiar with available search parameters
  • …and the homepage with further examples/documentation is here
  • Information on NIH study sections, IRGs, etc. is here
  • h/t to Chris whose code on github was all I could find existing in R and served as a starting point for this work

Session Information

The version number of R and packages loaded for generating the vignette were:

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
#>  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
#> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] tufte_0.13         scales_1.3.0       dplyr_1.1.4        ggrepel_0.9.6     
#> [5] ggplot2_3.5.1      repoRter.nih_0.1.4 tibble_3.2.1      
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.9        utf8_1.2.4        generics_0.1.3    stringi_1.8.4     digest_0.6.37    
#>  [6] magrittr_2.0.3    evaluate_1.0.1    grid_4.4.1        timechange_0.3.0  fastmap_1.2.0    
#> [11] jsonlite_1.8.9    httr_1.4.7        purrr_1.0.2       fansi_1.0.6       jquerylib_0.1.4  
#> [16] cli_3.6.3         rlang_1.1.4       crayon_1.5.3      munsell_0.5.1     withr_3.0.2      
#> [21] cachem_1.1.0      yaml_2.3.10       tools_4.4.1       colorspace_2.1-1  curl_5.2.3       
#> [26] assertthat_0.2.1  buildtools_1.0.0  vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4  
#> [31] lubridate_1.9.3   snakecase_0.11.1  stringr_1.5.1     janitor_2.2.0     pkgconfig_2.0.3  
#> [36] pillar_1.9.0      bslib_0.8.0       gtable_0.3.6      Rcpp_1.0.13       glue_1.8.0       
#> [41] xfun_0.49         tidyselect_1.2.1  highr_0.11        sys_3.4.3         knitr_1.48       
#> [46] htmltools_0.5.8.1 rmarkdown_2.28    maketools_1.3.1   compiler_4.4.1