--- title: "repoRter.nih: a convenient R interface to the NIH RePORTER Project API" author: - Michael Barr, ACAS, MAAA, CPCU output: pdf_document: keep_md: yes keep_tex: yes pandoc_args: --listings html_document: highlight: tango theme: flatly header-includes: - \UseRawInputEncoding - \usepackage{xcolor} - \definecolor{airforceblue}{rgb}{0.36, 0.54, 0.66} - \definecolor{cerulean}{rgb}{0.0, 0.48, 0.65} - \definecolor{burntorange}{rgb}{0.8, 0.33, 0.0} - \usepackage{hyperref} - \hypersetup{colorlinks=true,linkcolor=airforceblue,citecolor=cerulean,urlcolor=burntorange} - \lstset{ backgroundcolor=\color[RGB]{248,248,248}, basewidth=0.5em, basicstyle=\small\ttfamily, breakatwhitespace=true, breakautoindent=true, breaklines=true, captionpos=b, columns=flexible, commentstyle=\color[rgb]{0.56,0.35,0.01}\itshape, escapeinside={\%*}{*)}, frame=tb, keepspaces=true, keywordstyle=\color[rgb]{0.13,0.29,0.53}\bfseries, linewidth=\textwidth, numberstyle=\footnotesize, showspaces=false, showstringspaces=false, showtabs=false, stringstyle=\color[rgb]{0.31,0.60,0.02}, tabsize=2, } vignette: > %\VignetteIndexEntry{repoRter.nih: a convenient R interface to the NIH RePORTER Project API} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", width = 95, fig.dpi = 96 ) opts.old <- options() options(width = 95, cli.unicode = FALSE, cli.width = 95) ## confirm RePORTER is currently available before building this, for CRAN r <- httr::POST("https://api.reporter.nih.gov/v2/projects/search", httr::accept("application/json"), httr::content_type_json(), body = "{\"criteria\":{\"foa\":[\"RFA-NS-19-036\"]},\"include_fields\":[\"ApplId\"],\"offset\":0,\"limit\":10,\"sort_field\":\"appl_id\",\"sort_order\":\"desc\"}", encode = "raw" ) if (r$status_code != 200) knitr::knit_exit() ``` ```{r, echo = FALSE, message = FALSE, fig.width=7, fig.height=4, fig.align='center'} knitr::include_graphics("covid_plot.png") ``` # Introduction The US National Institute of Health (NIH) received funding of approximately \$42 billion in fiscal year 2022; \$31 billion (72%) of this was awarded by the NIH in the form of research grant funding to hospitals, medical colleges, non-profits, businesses, and other organizations based in the U.S. and abroad.^[https://nexus.od.nih.gov/all/2021/04/21/fy-2020-by-the-numbers-extramural-investments-in-research] The NIH maintains a publicly available database called "RePORTER" to track this substantial flow of grant funding and makes it available to the public via [a web-based query interface](https://reporter.nih.gov/) as well as an API. > "The NIH RePORTER APIs is designed to programmatically expose relevant scientific awards data from both NIH and non-NIH federal agencies for the consumption of project teams or external 3rd party applications to support reporting, data analysis, data integration or to satisfy other business needs as deemed pertinent." >`r tufte::quote_footer('--NIH RePORTER v2 API Documentation')` This data can have significant value for many audiences, including researchers, investors, industry, watchdogs/public advocates, and R users. But constructing queries and retrieving results programmatically involves some coding overhead which can be a challenge for those not familiar with RESTful APIs and JSON; it takes some effort even for those who are. The `repoRter.nih` package aims to simplify this task for the typical analyst scripting in R. # Getting Started ## Installation This package (latest stable release) can be installed from CRAN the usual way: ```{r, eval = FALSE} install.packages("repoRter.nih") ``` The current dev version can be installed from github, on the `dev` branch: ```{r, eval = FALSE} devtools::install_github('bikeactuary/repoRter.nih@dev') ``` I welcome R developers more capable than myself to collaborate on improving the source code, documentation, and unit testing in this package. # Basic Workflow ```{r, message = FALSE} library(tibble) library(repoRter.nih) library(ggplot2) library(ggrepel) library(dplyr) library(scales) library(tufte) ``` The `make_req()` method is used to generate a valid JSON request object. The req can subsequently be passed to the RePORTER Project API and results retrieved via the `get_nih_data()` method. Generating the request: ```{r} # all projects funded by the Paycheck Protection Act, Coronavirus Response and # Relief Act, and American Rescue Plan, in fiscal year 2021 req <- make_req(criteria = list(fiscal_years = 2021, covid_response = c("C4", "C5", "C6"))) ``` Sending the request and retrieving results: ```{r} res <- get_nih_data(req) class(res) ``` A tibble is returned containing 43 columns. This data is not flat - several columns are nested `data.frame`s and `list`s (of variable length vectors and `data.frame`s of varying height). ```{r} res %>% glimpse(width = getOption("cli.width")) ``` # Criteria-Field Translation A dataset (`nih_fields`) is provided with this package to assist in translating between field names used in the payload `criteria`, column names in the return data, and field names used in the `include_fields`, `exclude_fields`, and `sort_field` arguments. ```{r} data("nih_fields") nih_fields %>% print ``` Some fields can not be used as filtering `criteria` - these will show `NA` in the `payload_name` column. # Generating Requests Most of the detail (and function documentation) is around the many parameters available in RePORTER to filter/search project records. Let's get into some of the capabilities. ## Default Request If no arguments are supplied, the default behavior of `make_req()` is to generate a request for all projects funded in `fiscal_years = lubridate::year(Sys.Date())`. Limiting requests to a single year is often necessary (depending on additional filtering criteria used) due to a RePORTER restriction that a maximum of 10K records may be returned from any result set. There are currently ~2.6M projects in the database going back to fiscal year 1985, and each fiscal year tends to have 70-100K projects, so the 10K limit can be restrictive to the user wanting a broad search. ```{r} req <- make_req() ``` The method prints a helpful message to the console in addition to returning the JSON. Set `message = FALSE` if you wish to suppress this message. ## Limiting Data Retrieved You can limit both the width and height of the result set retrieved from the API. ### Fields We probably will not need to fetch every field every time. The `include_fields` argument is provided to specify a limited set of fields to be returned. Alternatively, fields may be excluded using `exclude_fields`. ### Records (projects) This package provides the ability to retrieve only a limited number of result pages via the `max_pages` argument. This can be useful for developing/testing your queries (and for reducing time to render package documentation). Each page has a record count equal to `limit` - so setting `max_pages = 5` with the default `limit = 500` (the maximum permitted by RePORTER) in `make_req()` will result in up to 2,500 total records returned. ### Ex. 1 - Limiting results and selecting fields ```{r} data("nih_fields") fields <- nih_fields %>% filter(response_name %in% c("appl_id", "subproject_id", "project_title", "fiscal_year", "award_amount", "is_active", "project_start_date")) %>% pull(include_name) req <- make_req(include_fields = fields, limit = 500, message = FALSE) # default res <- get_nih_data(query = req, max_pages = 1) res %>% glimpse(width = getOption("cli.width")) ``` ## Some Vanilla Criteria Many criteria are passed as vectors within the `criteria` list argument. We will cover some of the most useful examples: ### Ex. 2 - Organization search We can refine our query results by providing filtering criteria to `make_req()`, and by extension to the API. Suppose we want all currently active projects, funded in fiscal years 2017 through 2021, with a specific organization in mind (though we don't know exactly how its name will appear in RePORTER): ```{r} req <- make_req(criteria = list( fiscal_years = 2010:2011, include_active_projects = TRUE, org_names = c("Yale", "New Haven") ), include_fields = c("Organization", "FiscalYear", "AwardAmount"), message = FALSE) ``` Here we are asking for any orgs containing the strings "yale" or "new haven" (ignoring case) - there are implied wildcards on either end of the strings we provide. This is the same as `org_name LIKE '%yale%' OR org_name LIKE '%new haven%'` in a SQL WHERE clause. ```{r} res <- get_nih_data(req, max_pages = 1) res %>% glimpse(width = getOption("cli.width")) ``` Notice the column `organization` is a nested data frame - it has 17 columns and always a single record. Setting `flatten_result = TRUE` in the call to `get_nih_data()` will flatten all such return fields, prefixing the original field name and returning with clean names (see `janitor::clean_names()`). ```{r} res <- get_nih_data(req, max_pages = 1, flatten_result = TRUE) res %>% glimpse(width = getOption("cli.width")) ``` Most users will prefer the flattened format above. It looks like Yale is busy, but it is not the only org matching our search. ```{r, echo = FALSE} evl <- class(res)[1] == "tbl_df" ``` ```{r, eval=evl} res %>% group_by(organization_org_name) %>% summarise(project_count = n()) ``` The `org_names_exact_match` criteria can be used as an alternative when we know the exact org name as it appears in RePORTER, if we want only that org's projects returned. ### Ex. 3 - Geographic search We can also filter projects by the geographic location (country/state/city) of the applicant organization. ```{r} ## A valid request but probably not what we want req <- make_req(criteria = list( fiscal_years = 2010:2011, include_active_projects = TRUE, org_cities = "New Haven", org_states = "WY" ), include_fields = c("Organization", "FiscalYear", "AwardAmount"), message = FALSE ## suppress printed message ) res <- get_nih_data(req, max_pages = 5, flatten_result = TRUE) ``` Multiple criteria are usually connected by logical "AND" - there are no orgs based in the city of New Haven in Wyoming state (because it doesn't exist.) ```{r} req <- make_req(criteria = list( fiscal_years = 2015, include_active_projects = TRUE, org_states = "WY" ), include_fields = c("ApplId", "Organization", "FiscalYear", "AwardAmount"), sort_field = "AwardAmount", sort_order = "desc", message = FALSE) res <- get_nih_data(req, flatten_result = TRUE) res %>% glimpse(width = getOption("cli.width")) ``` Why are there projects from more recent years than 2015? Because the `include_active_projects` flag adds in active projects that match all criteria aside from `fiscal_years` (this appears to be the intended behavior by RePORTER). ### Ex. 3 - Coronavirus/Covid-19 research We already provided one example of this search criteria above. Let's mix it up and request all Covid response projects. ```{r} ## all projects funded by the Paycheck Protection Act, Coronavirus Response and Relief Act, ## and American Rescue Plan, over all years req <- make_req(criteria = list(covid_response = c("All")), include_fields = nih_fields %>% filter(payload_name %in% c("award_amount_range", "covid_response")) %>% pull(include_name)) res <- get_nih_data(req, max_pages = 1) ``` ```{r, echo=FALSE} evl <- class(res)[1] == "tbl_df" ``` Let's inspect the result: ```{r, eval=evl} res$covid_response %>% class() res$covid_response[[1]] ``` `covid_response` is a nested list (with character vectors of variable length) within the return tibble. We can use `flatten_result = TRUE` here - elements of each vector will be collapsed to a single string delimited by ";", massaging the list to a single character vector. ```{r covid, eval=FALSE} ## all projects funded by the Paycheck Protection Act, Coronavirus Response and Relief Act, ## and American Rescue Plan, in fiscal year 2021 req <- make_req(criteria = list(covid_response = c("All")), message = FALSE) res <- get_nih_data(req, flatten_result = TRUE) ``` ```{r, eval=FALSE} unique(res$covid_response) ``` ```{r, echo=FALSE} res <- readRDS("cov_res.RDS") unique(res$covid_response) ``` Some projects are being funded from multiple sources. Summarizing all Covid-related project awards: ```{r, eval=FALSE} library(ggplot2) res %>% left_join(covid_response_codes, by = "covid_response") %>% mutate(covid_code_desc = case_when(!is.na(fund_src) ~ paste0(covid_response, ": ", fund_src), TRUE ~ paste0(covid_response, " (Multiple)"))) %>% group_by(covid_code_desc) %>% summarise(total_awards = sum(award_amount) / 1e6) %>% ungroup() %>% arrange(desc(covid_code_desc)) %>% mutate(prop = total_awards / sum(total_awards), csum = cumsum(prop), ypos = csum - prop/2 ) %>% ggplot(aes(x = "", y = prop, fill = covid_code_desc)) + geom_bar(stat="identity") + geom_text_repel(aes(label = paste0(dollar(total_awards, accuracy = 1, suffix = "M"), "\n", percent(prop, accuracy = .01)), y = ypos), show.legend = FALSE, nudge_x = .8, size = 3, color = "grey25") + coord_polar(theta ="y") + theme_void() + theme(legend.position = "right", legend.title = element_text(colour = "grey25"), legend.text = element_text(colour="blue", size=6, face="bold"), plot.title = element_text(color = "grey25"), plot.caption = element_text(size = 6)) + labs(caption = "Data Source: NIH RePORTER API v2") + ggtitle("Legislative Source for NIH Covid Response Project Funding") ``` ```{r, echo = FALSE, message = FALSE, fig.width=7, fig.height=4, fig.align='center'} knitr::include_graphics("covid_plot.png") ``` A second dataset is provided to translate the `covid_response` codes; it includes both the long-form and a shorter version of the source name. ```{r} data("covid_response_codes") covid_response_codes %>% print ``` # Some Rocky Road Criteria Other criteria provide search and filtering capability on many of the nested data elements. These criteria are passed as lists and must include a value for each of the named elements within. ### Ex. 4 - Principal Investigator/Officer name search The `pi_names` and `po_names` criteria allow the user to search for projects based on the first and last names of Principal Investigators and Principal Officers assigned. Each of these criteria must be provided as a list with three named character vector elements: `first_name`, `last_name`, and `any_name`. _Even if you only want to search on one of these name fields, you must provide the remaining elements as an empty string._ We will demonstrate with a search on PI name: ```{r} ## projects funded in 2021 where the principal investigator first name ## is "Michael" or begins with "Jo" req <- make_req(criteria = list(fiscal_years = 2021, pi_names = list(first_name = c("Michael", "Jo*"), last_name = c(""), # must specify all pi_names elements always any_name = character(1))), include_fields = nih_fields %>% filter(payload_name == "pi_names") %>% pull(include_name), message = FALSE) res <- get_nih_data(req, max_pages = 1, flatten_result = TRUE) res %>% glimpse(width = getOption("cli.width")) ``` Here we searched for any projects with a PI first-named "Michael" or beginning with "Jo" - the "*" is a wildcard operator. Note that the first column in the return is a list of data frames of _variable height_ (not a nested `data.frame`) - we leave such returned elements to the user to handle extraction/formatting - flattening is only performed for lists of atomic vectors and nested data frames. ### Ex. 5 - Advanced text search RePORTER allows users to search the project title, abstract, and tags for specific terms or phrases. You can access this capability with the `advanced_text_search` criteria - a named list with three elements: - `operator` may be either "and", "or", or "advanced" - and/or will specify the logical operator connecting multiple search terms. "advanced" allows the user to pass a boolean search string directly; - `search_field` can be any or multiple of "terms", "abstract", "projecttitle." To search all items, specify "all" or "" (a length 1 vector with an empty string); - `search_text` may be either a length 1 character vector of space-delimited search terms (when using "and" or "or" for the operator argument - the logical operator is inserted between all search terms); or it may be a boolean search string (when specifying "advanced" for the operator argument). ```{r} ## using advanced_text_search with boolean search string req <- make_req(criteria = list(advanced_text_search = list(operator = "advanced", search_field = c("terms", "abstract"), search_text = "(head AND trauma) OR \"brain damage\" AND NOT \"psychological\"")), include_fields = c("ProjectTitle", "AbstractText", "Terms") ) res <- get_nih_data(req, max_pages = 1) ``` Let's inspect the fields we searched from one of these results: ```{r, echo = FALSE} evl <- class(res)[1] == "tbl_df" ``` ```{r, eval=evl} one_rec <- res %>% slice(42) %>% mutate(abstract_text = gsub("[\r\n]", " ", abstract_text)) one_rec %>% pull(project_title) %>% print ``` ```{r, eval=evl} ## substr to avoid LaTeX error exceeding char limit one_rec %>% pull(abstract_text) %>% substr(1, 85) %>% print ``` ```{r, eval=evl} one_rec %>% pull(terms) %>% substr(1, 85) %>% print ``` # Large Result Sets The RePORTER API provides no direct way to obtain complete result sets when searches yield over 10,000 records. `get_nih_data()` provides the `return_meta` argument which is defaulted to `FALSE`. When set to `TRUE` and combined with a little programming, you can easily obtain full result sets well beyond the 10K limit. One approach may be the following: 1. Obtain a sample from your full result set by making the query you desire and calling `get_nih_data()` with `max_pages = 1` (or some small number of pages); also set `return_meta = TRUE` in order to determine the total number of records in the full result set 2. Calculate quantiles for the sample distribution of a column of your choice (e.g. `award_amount`) + Set the # of quantiles such that you can confidently infer that the number of records within each quantile range will contain <10K records *within the full result set* 3. Iterate over your quantiles making separate requests, passing the endpoints of each quantile to `award_amount_range` criteria + Wait until the end to flatten the combined results since some columns may flatten differently on smaller individual result sets, causing problems in combining them after flattening 4. Bind your list of results together 5. Flatten the complete result set, if desired Below is an implementation of the above logic: ```{r, eval = FALSE} all_res <- list() for(y in 2017:2021) { ## five years to loop over, each year is ~80K records ## We only need the AwardAmount for quantiles req_sample <- make_req(criteria = list(fiscal_years = y), include_fields = "AwardAmount") ## get a sample of the result set - 1000 records should be enough ## return the metadata res_sample <- get_nih_data(req_sample, max_pages = 2, return_meta = TRUE) paste0("There are ", res_sample$meta$total, " results for fiscal year ", y) %>% print() ## deciles of award amount - each decile should contain ~7,314.2 records, approximately qtiles <- res_sample$records %>% pull(award_amount) %>% quantile(na.rm = TRUE, probs = seq(.1, 1, .1)) ## list for qtile results (full year) this_res <- list() ## for each qtile for (i in 1:length(qtiles)) { if (i == 1) { award_min <- 0 } else { award_min <- ceiling(qtiles[i-1])+.01 } if (i == length(qtiles)) { award_max <- 1e9 ## arbitrarily huge } else { award_max <- ceiling(qtiles[i]) } req <- make_req(criteria = list(fiscal_years = y, award_amount_range = list(min_amount = award_min, max_amount = award_max))) ## result set for quantile this_res[[i]] <- get_nih_data(req, flatten_result = FALSE) } ## list of result sets for each year yr_res[[y %>% as.character()]] <- this_res } ## shape it up all_res <- unlist(yr_res, recursive = FALSE) %>% bind_rows() %>% flatten(recursive = FALSE) %>% clean_names() ## pull out everything that is flat flat_columns <- all_res %>% select_if(is.atomic) ## everything that isn't annoying_columns <- all_res %>% select_if(!is.atomic) ``` Note that using `award_amount` for this purpose will omit records with missing values. If you need these included, you may consider similar logic applied to an alternative field such as `award_notice_date`. # Additional Resources - The [RePORTER web interface](https://reporter.nih.gov/advanced-search) and [official API documentation](https://api.reporter.nih.gov/documents/Data%20Elements%20for%20RePORTER%20Project%20API%20v2.pdf) are useful for getting familiar with available search parameters - ...and the homepage with further examples/documentation is [here](https://api.reporter.nih.gov/) - Information on NIH study sections, IRGs, etc. is [here](https://public.csr.nih.gov/StudySections) - h/t to Chris whose [code on github](https://github.com/christopherBelter/nih_reporter_api) was all I could find existing in R and served as a starting point for this work # Session Information The version number of R and packages loaded for generating the vignette were: ```{r} sessionInfo() ``` ```{r, include=FALSE} options(opts.old) ```