The US National Institute of Health (NIH) received funding of approximately $42 billion in fiscal year 2022; $31 billion (72%) of this was awarded by the NIH in the form of research grant funding to hospitals, medical colleges, non-profits, businesses, and other organizations based in the U.S. and abroad.1 The NIH maintains a publicly available database called “RePORTER” to track this substantial flow of grant funding and makes it available to the public via a web-based query interface as well as an API.
“The NIH RePORTER APIs is designed to programmatically expose relevant scientific awards data from both NIH and non-NIH federal agencies for the consumption of project teams or external 3rd party applications to support reporting, data analysis, data integration or to satisfy other business needs as deemed pertinent.”
This data can have significant value for many audiences, including
researchers, investors, industry, watchdogs/public advocates, and R
users. But constructing queries and retrieving results programmatically
involves some coding overhead which can be a challenge for those not
familiar with RESTful APIs and JSON; it takes some effort even for those
who are. The repoRter.nih
package aims to simplify this
task for the typical analyst scripting in R.
This package (latest stable release) can be installed from CRAN the usual way:
The current dev version can be installed from github, on the
dev
branch:
I welcome R developers more capable than myself to collaborate on improving the source code, documentation, and unit testing in this package.
library(tibble)
library(repoRter.nih)
library(ggplot2)
library(ggrepel)
library(dplyr)
library(scales)
library(tufte)
The make_req()
method is used to generate a valid JSON
request object. The req can subsequently be passed to the RePORTER
Project API and results retrieved via the get_nih_data()
method.
Generating the request:
# all projects funded by the Paycheck Protection Act, Coronavirus Response and
# Relief Act, and American Rescue Plan, in fiscal year 2021
req <- make_req(criteria =
list(fiscal_years = 2021,
covid_response = c("C4", "C5", "C6")))
#> This is your JSON payload:
#> {
#> "criteria": {
#> "fiscal_years": [
#> 2021
#> ],
#> "covid_response": [
#> "C4",
#> "C5",
#> "C6"
#> ],
#> "use_relevance": false,
#> "include_active_projects": false,
#> "exclude_subprojects": false,
#> "multi_pi_only": false,
#> "newly_added_projects_only": false,
#> "sub_project_only": false
#> },
#> "offset": 0,
#> "limit": 500
#> }
#>
#> If you receive a non-200 API response, compare this formatting (boxes, braces, quotes, etc.) to the 'Complete Payload' schema provided here:
#> https://api.reporter.nih.gov/?urls.primaryName=V2.0#/Search/post_v2_projects_search
Sending the request and retrieving results:
res <- get_nih_data(req)
#> Retrieving first page of results (up to 500 records)
class(res)
#> [1] "tbl_df" "tbl" "data.frame"
A tibble is returned containing 43 columns. This data is not flat -
several columns are nested data.frame
s and
list
s (of variable length vectors and
data.frame
s of varying height).
res %>% glimpse(width = getOption("cli.width"))
#> Rows: 323
#> Columns: 44
#> $ appl_id <int> 10425707, 10255113, 10403857, 10258548, 10439178, 10446500, ~
#> $ subproject_id <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
#> $ fiscal_year <int> 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, ~
#> $ project_num <chr> "3P20GM104417-08S1", "3P20GM104417-07S1", "3R01ES028615-07S1~
#> $ project_serial_num <chr> "GM104417", "GM104417", "ES028615", "ES028615", "DC019579", ~
#> $ organization <df[,17]> <data.frame[31 x 17]>
#> $ award_type <chr> "3", "3", "3", "3", "7", "3", "1", "3", "3", NA, "1", "1~
#> $ activity_code <chr> "P20", "P20", "R01", "R01", "U01", "R01", "R01", "U01", "U19~
#> $ award_amount <int> 681188, 1115953, 300000, 1609765, 877287, 348242, 667277, 26~
#> $ is_active <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS~
#> $ project_num_split <df[,7]> <data.frame[31 x 7]>
#> $ principal_investigators <list> [<data.frame[1 x 7]>], [<data.frame[1 x 7]>], [<data.frame[2~
#> $ contact_pi_name <chr> "ADAMS, ALEXANDRA K.", "ADAMS, ALEXANDRA K.", "AL-HENDY, ~
#> $ program_officers <list> [<data.frame[1 x 4]>], [<data.frame[1 x 4]>], [<data.frame[~
#> $ agency_ic_admin <df[,3]> <data.frame[31 x 3]>
#> $ agency_ic_fundings <list> [<data.frame[1 x 5]>], [<data.frame[1 x 5]>], [<data.frame[1~
#> $ cong_dist <chr> "MT-01", "MT-01", "IL-01", "IL-01", "MA-08", "MA-08", "MO-0~
#> $ spending_categories <list> <NULL>, <44, 89, 176, 180, 4835, 5009, 5011, 246, 3641, 2~
#> $ project_start_date <chr> "2021-09-01T12:09:00Z", "2020-11-17T12:11:00Z", "2020-11-11~
#> $ project_end_date <chr> "2023-08-31T12:08:00Z", "2023-08-31T12:08:00Z", "2023-07-31T~
#> $ organization_type <df[,3]> <data.frame[31 x 3]>
#> $ opportunity_number <chr> "PAR-18-264", "PA-20-135", "PA-20-272", "PA-20-135", "RFA-O~
#> $ full_study_section <df[,6]> <data.frame[31 x 6]>
#> $ award_notice_date <chr> "2021-09-21T12:09:00Z", "2020-11-17T12:11:00Z", "2021-08-31T~
#> $ is_new <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS~
#> $ mechanism_code_dc <chr> "RC", "RC", "RP", "RP", "RP", "RP", "RP", "RP", "RP", "NS~
#> $ core_project_num <chr> "P20GM104417", "P20GM104417", "R01ES028615", "R01ES028615", ~
#> $ terms <chr> "<Risk Behaviors><Risky Behavior><at risk behavior><Rural~
#> $ pref_terms <chr> "Affect;Age;Alaska Native;American Indians;Applications Gran~
#> $ abstract_text <chr> "The mission of the Center for American Indian and Rural Hea~
#> $ project_title <chr> "Center for American Indian and Rural Health Equity", "Cente~
#> $ phr_text <chr> "The mission of the Center for American Indian and Rural Hea~
#> $ spending_categories_desc <chr> NA, "American Indian or Alaska Native; Behavioral and Social~
#> $ agency_code <chr> "NIH", "NIH", "NIH", "NIH", "NIH", "NIH", "NIH", "NIH", "NIH~
#> $ covid_response <list> "C6", "C4", "C6", "C4", "C4", "C6", "C6", "C6", "C6", "C6", ~
#> $ arra_funded <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", ~
#> $ budget_start <chr> "2021-09-01T12:09:00Z", "2020-11-17T12:11:00Z", "2021-09-01T~
#> $ budget_end <chr> "2023-08-31T12:08:00Z", "2023-08-31T12:08:00Z", "2023-07-31T~
#> $ cfda_code <chr> "859", "310", "113", "310", "310", "855", "855", "855", "855~
#> $ funding_mechanism <chr> "Research Centers", "Research Centers", "Non-SBIR/STTR", "N~
#> $ direct_cost_amt <int> 616778, 1006607, 297064, 1560464, 569403, 207287, 473684, 15~
#> $ indirect_cost_amt <int> 64410, 109346, 2936, 49301, 307884, 140955, 193593, 116250, ~
#> $ project_detail_url <chr> "https://reporter.nih.gov/project-details/10425707", "https:~
#> $ date_added <chr> "2021-09-25T04:09:53Z", "2020-11-21T07:11:17Z", "2021-09-04T~
A dataset (nih_fields
) is provided with this package to
assist in translating between field names used in the payload
criteria
, column names in the return data, and field names
used in the include_fields
, exclude_fields
,
and sort_field
arguments.
data("nih_fields")
nih_fields %>% print
#> # A tibble: 43 x 5
#> payload_name response_name include_name return_class mod_ind
#> <chr> <chr> <chr> <chr> <int>
#> 1 appl_ids appl_id ApplId integer 1
#> 2 <NA> subproject_id SubprojectId character 0
#> 3 fiscal_years fiscal_year FiscalYear integer 1
#> 4 project_nums project_num ProjectNum character 1
#> 5 serial_num project_serial_num ProjectSerialNum character 1
#> 6 <NA> organization Organization data.frame 0
#> 7 award_types award_type AwardType character 1
#> 8 activity_codes activity_code ActivityCode character 1
#> 9 award_amount_range award_amount AwardAmount integer 1
#> 10 include_active_projects is_active IsActive logical 1
#> # i 33 more rows
Some fields can not be used as filtering criteria
-
these will show NA
in the payload_name
column.
Most of the detail (and function documentation) is around the many parameters available in RePORTER to filter/search project records. Let’s get into some of the capabilities.
If no arguments are supplied, the default behavior of
make_req()
is to generate a request for all projects funded
in fiscal_years = lubridate::year(Sys.Date())
. Limiting
requests to a single year is often necessary (depending on additional
filtering criteria used) due to a RePORTER restriction that a maximum of
10K records may be returned from any result set. There are currently
~2.6M projects in the database going back to fiscal year 1985, and each
fiscal year tends to have 70-100K projects, so the 10K limit can be
restrictive to the user wanting a broad search.
req <- make_req()
#> This is your JSON payload:
#> {
#> "criteria": {
#> "fiscal_years": [
#> 2024
#> ],
#> "use_relevance": false,
#> "include_active_projects": false,
#> "exclude_subprojects": false,
#> "multi_pi_only": false,
#> "newly_added_projects_only": false,
#> "sub_project_only": false
#> },
#> "offset": 0,
#> "limit": 500
#> }
#>
#> If you receive a non-200 API response, compare this formatting (boxes, braces, quotes, etc.) to the 'Complete Payload' schema provided here:
#> https://api.reporter.nih.gov/?urls.primaryName=V2.0#/Search/post_v2_projects_search
The method prints a helpful message to the console in addition to
returning the JSON. Set message = FALSE
if you wish to
suppress this message.
You can limit both the width and height of the result set retrieved from the API.
We probably will not need to fetch every field every time. The
include_fields
argument is provided to specify a limited
set of fields to be returned. Alternatively, fields may be excluded
using exclude_fields
.
This package provides the ability to retrieve only a limited number
of result pages via the max_pages
argument. This can be
useful for developing/testing your queries (and for reducing time to
render package documentation). Each page has a record count equal to
limit
- so setting max_pages = 5
with the
default limit = 500
(the maximum permitted by RePORTER) in
make_req()
will result in up to 2,500 total records
returned.
data("nih_fields")
fields <- nih_fields %>%
filter(response_name %in%
c("appl_id", "subproject_id", "project_title", "fiscal_year",
"award_amount", "is_active", "project_start_date")) %>%
pull(include_name)
req <- make_req(include_fields = fields,
limit = 500,
message = FALSE) # default
res <- get_nih_data(query = req,
max_pages = 1)
#> Retrieving first page of results (up to 500 records)
#> max_pages set to 1 by user. Result set contains 160 pages. Only partial results will be retrieved.
res %>% glimpse(width = getOption("cli.width"))
#> Rows: 500
#> Columns: 7
#> $ appl_id <int> 10824314, 10914092, 10836918, 10767874, 10909117, 10946856, 108090~
#> $ subproject_id <chr> NA, NA, NA, "6891", "8632", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
#> $ fiscal_year <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, ~
#> $ award_amount <int> 78892, 191250, 249366, 351922, 183746, 50000, 858535, 44553, 33305~
#> $ is_active <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE,~
#> $ project_start_date <chr> "2022-05-01T12:05:00Z", "2023-08-25T12:08:00Z", "2024-02-20T12:02:~
#> $ project_title <chr> "Elucidating novel molecular mechanisms of irisin-mediated effects~
Many criteria are passed as vectors within the criteria
list argument. We will cover some of the most useful examples:
We can refine our query results by providing filtering criteria to
make_req()
, and by extension to the API. Suppose we want
all currently active projects, funded in fiscal years 2017 through 2021,
with a specific organization in mind (though we don’t know exactly how
its name will appear in RePORTER):
req <- make_req(criteria =
list(
fiscal_years = 2010:2011,
include_active_projects = TRUE,
org_names = c("Yale", "New Haven")
),
include_fields = c("Organization", "FiscalYear", "AwardAmount"),
message = FALSE)
Here we are asking for any orgs containing the strings “yale” or “new
haven” (ignoring case) - there are implied wildcards on either end of
the strings we provide. This is the same as
org_name LIKE '%yale%' OR org_name LIKE '%new haven%'
in a
SQL WHERE clause.
res <- get_nih_data(req, max_pages = 1)
#> Retrieving first page of results (up to 500 records)
#> max_pages set to 1 by user. Result set contains 8 pages. Only partial results will be retrieved.
res %>% glimpse(width = getOption("cli.width"))
#> Rows: 500
#> Columns: 3
#> $ fiscal_year <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, ~
#> $ organization <df[,17]> <data.frame[31 x 17]>
#> $ award_amount <int> 68350, 241530, 452542, 522742, 742946, 617340, 605948, 769465, 33892~
Notice the column organization
is a nested data frame -
it has 17 columns and always a single record. Setting
flatten_result = TRUE
in the call to
get_nih_data()
will flatten all such return fields,
prefixing the original field name and returning with clean names (see
janitor::clean_names()
).
res <- get_nih_data(req,
max_pages = 1,
flatten_result = TRUE)
#> Retrieving first page of results (up to 500 records)
#> max_pages set to 1 by user. Result set contains 8 pages. Only partial results will be retrieved.
#> Warning: There was 1 warning in `mutate()`.
#> i In argument: `across(...)`.
#> Caused by warning:
#> ! Using `across()` without supplying `.cols` was deprecated in dplyr 1.1.0.
#> i Please supply `.cols` instead.
res %>% glimpse(width = getOption("cli.width"))
#> Rows: 500
#> Columns: 19
#> $ fiscal_year <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, ~
#> $ award_amount <int> 68350, 241530, 452542, 522742, 742946, 617340, 605948,~
#> $ organization_org_name <chr> "UNIVERSITY OF NEW HAVEN", "YALE UNIVERSITY", "YALE UN~
#> $ organization_city <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
#> $ organization_country <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
#> $ organization_org_city <chr> "WEST HAVEN", "NEW HAVEN", "NEW HAVEN", "NEW HAVEN", "~
#> $ organization_org_country <chr> "UNITED STATES", "UNITED STATES", "UNITED STATES", "UN~
#> $ organization_org_state <chr> "CT", "CT", "CT", "CT", "CT", "CT", "CT", "CT", "CT", ~
#> $ organization_org_state_name <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
#> $ organization_dept_type <chr> "BIOLOGY", "PATHOLOGY", "NEUROLOGY", "INTERNAL MEDICIN~
#> $ organization_fips_country_code <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
#> $ organization_org_duns <chr> "067075382", "043207562", "043207562", "043207562", "0~
#> $ organization_org_ueis <chr> "FZBDVM1MBTN9", "FL6GV84CKN57", "FL6GV84CKN57", "FL6GV~
#> $ organization_primary_duns <chr> "067075382", "043207562", "043207562", "043207562", "0~
#> $ organization_primary_uei <chr> "FZBDVM1MBTN9", "FL6GV84CKN57", "FL6GV84CKN57", "FL6GV~
#> $ organization_org_fips <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", ~
#> $ organization_org_ipf_code <chr> "207401", "9420201", "9420201", "9420201", "9420201", ~
#> $ organization_org_zipcode <chr> "065161916", "065208327", "065208327", "065208327", "0~
#> $ organization_external_org_id <int> 207401, 9420201, 9420201, 9420201, 9420201, 9420201, 9~
Most users will prefer the flattened format above. It looks like Yale is busy, but it is not the only org matching our search.
res %>%
group_by(organization_org_name) %>%
summarise(project_count = n())
#> # A tibble: 2 x 2
#> organization_org_name project_count
#> <chr> <int>
#> 1 UNIVERSITY OF NEW HAVEN 1
#> 2 YALE UNIVERSITY 499
The org_names_exact_match
criteria can be used as an
alternative when we know the exact org name as it appears in RePORTER,
if we want only that org’s projects returned.
We can also filter projects by the geographic location (country/state/city) of the applicant organization.
## A valid request but probably not what we want
req <- make_req(criteria =
list(
fiscal_years = 2010:2011,
include_active_projects = TRUE,
org_cities = "New Haven",
org_states = "WY"
),
include_fields = c("Organization", "FiscalYear", "AwardAmount"),
message = FALSE ## suppress printed message
)
res <- get_nih_data(req,
max_pages = 5,
flatten_result = TRUE)
#> Retrieving first page of results (up to 500 records)
#> Done - 0 records returned. Try a different search criteria.
Multiple criteria are usually connected by logical “AND” - there are no orgs based in the city of New Haven in Wyoming state (because it doesn’t exist.)
req <- make_req(criteria =
list(
fiscal_years = 2015,
include_active_projects = TRUE,
org_states = "WY"
),
include_fields = c("ApplId", "Organization", "FiscalYear", "AwardAmount"),
sort_field = "AwardAmount",
sort_order = "desc",
message = FALSE)
res <- get_nih_data(req,
flatten_result = TRUE)
#> Retrieving first page of results (up to 500 records)
res %>% glimpse(width = getOption("cli.width"))
#> Rows: 87
#> Columns: 20
#> $ appl_id <int> 8884461, 8898483, 10845672, 10883783, 10563569, 108456~
#> $ fiscal_year <int> 2015, 2015, 2024, 2024, 2022, 2024, 2024, 2015, 2015, ~
#> $ award_amount <int> 4957554, 3521553, 3418046, 2480469, 2255378, 1668382, ~
#> $ organization_org_name <chr> "WYOMING STATE DEPARTMENT OF HEALTH", "UNIVERSITY OF W~
#> $ organization_city <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
#> $ organization_country <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
#> $ organization_org_city <chr> "CHEYENNE", "LARAMIE", "LARAMIE", "LARAMIE", "LARAMIE"~
#> $ organization_org_country <chr> "UNITED STATES", "UNITED STATES", "UNITED STATES", "UN~
#> $ organization_org_state <chr> "WY", "WY", "WY", "WY", "WY", "WY", "WY", "WY", "WY", ~
#> $ organization_org_state_name <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
#> $ organization_dept_type <chr> NA, "PHARMACOLOGY", "PHARMACOLOGY", "ZOOLOGY", "ZOOLOG~
#> $ organization_fips_country_code <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
#> $ organization_org_duns <chr> "809915796", "069690956", "069690956", "069690956", "0~
#> $ organization_org_ueis <chr> "JP1QRJYYJG73", "FDR5YF2K32X5", "FDR5YF2K32X5", "FDR5Y~
#> $ organization_primary_duns <chr> "809915796", "069690956", "069690956", "069690956", "0~
#> $ organization_primary_uei <chr> "JP1QRJYYJG73", "FDR5YF2K32X5", "FDR5YF2K32X5", "FDR5Y~
#> $ organization_org_fips <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", ~
#> $ organization_org_ipf_code <chr> "9408801", "9412601", "9412601", "9412601", "9412601",~
#> $ organization_org_zipcode <chr> "820020001", "820712000", "820712000", "820712000", "8~
#> $ organization_external_org_id <int> 9408801, 9412601, 9412601, 9412601, 9412601, 9412601, ~
Why are there projects from more recent years than 2015? Because the
include_active_projects
flag adds in active projects that
match all criteria aside from fiscal_years
(this appears to
be the intended behavior by RePORTER).
Other criteria provide search and filtering capability on many of the nested data elements. These criteria are passed as lists and must include a value for each of the named elements within.
The pi_names
and po_names
criteria allow
the user to search for projects based on the first and last names of
Principal Investigators and Principal Officers assigned. Each of these
criteria must be provided as a list with three named character vector
elements: first_name
, last_name
, and
any_name
. Even if you only want to search on one of
these name fields, you must provide the remaining elements as an empty
string. We will demonstrate with a search on PI name:
## projects funded in 2021 where the principal investigator first name
## is "Michael" or begins with "Jo"
req <- make_req(criteria =
list(fiscal_years = 2021,
pi_names = list(first_name = c("Michael", "Jo*"),
last_name = c(""), # must specify all pi_names elements always
any_name = character(1))),
include_fields = nih_fields %>%
filter(payload_name == "pi_names") %>%
pull(include_name),
message = FALSE)
res <- get_nih_data(req,
max_pages = 1,
flatten_result = TRUE)
#> Retrieving first page of results (up to 500 records)
#> max_pages set to 1 by user. Result set contains 13 pages. Only partial results will be retrieved.
res %>% glimpse(width = getOption("cli.width"))
#> Rows: 500
#> Columns: 2
#> $ principal_investigators <list> [<data.frame[3 x 7]>], [<data.frame[2 x 7]>], [<data.frame[3~
#> $ contact_pi_name <chr> "LILE, JOSHUA ANTHONY", "ROUT, MICHAEL P", "BATTAGLIA, CATHER~
Here we searched for any projects with a PI first-named “Michael” or beginning with “Jo” - the “*” is a wildcard operator.
Note that the first column in the return is a list of data frames of
variable height (not a nested data.frame
) - we
leave such returned elements to the user to handle extraction/formatting
- flattening is only performed for lists of atomic vectors and nested
data frames.
RePORTER allows users to search the project title, abstract, and tags
for specific terms or phrases. You can access this capability with the
advanced_text_search
criteria - a named list with three
elements:
operator
may be either “and”, “or”, or “advanced” -
and/or will specify the logical operator connecting multiple search
terms. “advanced” allows the user to pass a boolean search string
directly;search_field
can be any or multiple of “terms”,
“abstract”, “projecttitle.” To search all items, specify “all” or “” (a
length 1 vector with an empty string);search_text
may be either a length 1 character vector
of space-delimited search terms (when using “and” or “or” for the
operator argument - the logical operator is inserted between all search
terms); or it may be a boolean search string (when specifying “advanced”
for the operator argument).## using advanced_text_search with boolean search string
req <- make_req(criteria =
list(advanced_text_search =
list(operator = "advanced",
search_field = c("terms", "abstract"),
search_text = "(head AND trauma) OR \"brain damage\" AND NOT \"psychological\"")),
include_fields = c("ProjectTitle", "AbstractText", "Terms") )
#> This is your JSON payload:
#> {
#> "criteria": {
#> "advanced_text_search": {
#> "operator": "advanced",
#> "search_field": "terms,abstract",
#> "search_text": "(head AND trauma) OR \"brain damage\" AND NOT \"psychological\""
#> },
#> "use_relevance": false,
#> "include_active_projects": false,
#> "exclude_subprojects": false,
#> "multi_pi_only": false,
#> "newly_added_projects_only": false,
#> "sub_project_only": false
#> },
#> "include_fields": [
#> "ProjectTitle",
#> "AbstractText",
#> "Terms"
#> ],
#> "offset": 0,
#> "limit": 500
#> }
#>
#> If you receive a non-200 API response, compare this formatting (boxes, braces, quotes, etc.) to the 'Complete Payload' schema provided here:
#> https://api.reporter.nih.gov/?urls.primaryName=V2.0#/Search/post_v2_projects_search
res <- get_nih_data(req, max_pages = 1)
#> Retrieving first page of results (up to 500 records)
#> max_pages set to 1 by user. Result set contains 44 pages. Only partial results will be retrieved.
Let’s inspect the fields we searched from one of these results:
one_rec <- res %>%
slice(42) %>%
mutate(abstract_text = gsub("[\r\n]", " ", abstract_text))
one_rec %>% pull(project_title) %>% print
#> [1] "The neuropathology of mild traumatic brain injury in Alzheimer's disease"
The RePORTER API provides no direct way to obtain complete result
sets when searches yield over 10,000 records.
get_nih_data()
provides the return_meta
argument which is defaulted to FALSE
. When set to
TRUE
and combined with a little programming, you can easily
obtain full result sets well beyond the 10K limit. One approach may be
the following:
get_nih_data()
with
max_pages = 1
(or some small number of pages); also set
return_meta = TRUE
in order to determine the total number
of records in the full result setaward_amount
)award_amount_range
criteriaBelow is an implementation of the above logic:
all_res <- list()
for(y in 2017:2021) { ## five years to loop over, each year is ~80K records
## We only need the AwardAmount for quantiles
req_sample <- make_req(criteria = list(fiscal_years = y),
include_fields = "AwardAmount")
## get a sample of the result set - 1000 records should be enough
## return the metadata
res_sample <- get_nih_data(req_sample, max_pages = 2, return_meta = TRUE)
paste0("There are ", res_sample$meta$total, " results for fiscal year ", y) %>%
print()
## deciles of award amount - each decile should contain ~7,314.2 records, approximately
qtiles <- res_sample$records %>% pull(award_amount) %>% quantile(na.rm = TRUE, probs = seq(.1, 1, .1))
## list for qtile results (full year)
this_res <- list()
## for each qtile
for (i in 1:length(qtiles)) {
if (i == 1) {
award_min <- 0
} else {
award_min <- ceiling(qtiles[i-1])+.01
}
if (i == length(qtiles)) {
award_max <- 1e9 ## arbitrarily huge
} else {
award_max <- ceiling(qtiles[i])
}
req <- make_req(criteria = list(fiscal_years = y,
award_amount_range = list(min_amount = award_min,
max_amount = award_max)))
## result set for quantile
this_res[[i]] <- get_nih_data(req, flatten_result = FALSE)
}
## list of result sets for each year
yr_res[[y %>% as.character()]] <- this_res
}
## shape it up
all_res <- unlist(yr_res, recursive = FALSE) %>%
bind_rows() %>%
flatten(recursive = FALSE) %>%
clean_names()
## pull out everything that is flat
flat_columns <- all_res %>%
select_if(is.atomic)
## everything that isn't
annoying_columns <- all_res %>%
select_if(!is.atomic)
Note that using award_amount
for this purpose will omit
records with missing values. If you need these included, you may
consider similar logic applied to an alternative field such as
award_notice_date
.
The version number of R and packages loaded for generating the vignette were:
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
#> [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] tufte_0.13 scales_1.3.0 dplyr_1.1.4 ggrepel_0.9.6
#> [5] ggplot2_3.5.1 repoRter.nih_0.1.4 tibble_3.2.1
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3 stringi_1.8.4 digest_0.6.37
#> [6] magrittr_2.0.3 evaluate_1.0.1 grid_4.4.1 timechange_0.3.0 fastmap_1.2.0
#> [11] jsonlite_1.8.9 httr_1.4.7 purrr_1.0.2 fansi_1.0.6 jquerylib_0.1.4
#> [16] cli_3.6.3 rlang_1.1.4 crayon_1.5.3 munsell_0.5.1 withr_3.0.2
#> [21] cachem_1.1.0 yaml_2.3.10 tools_4.4.1 colorspace_2.1-1 curl_5.2.3
#> [26] assertthat_0.2.1 buildtools_1.0.0 vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4
#> [31] lubridate_1.9.3 snakecase_0.11.1 stringr_1.5.1 janitor_2.2.0 pkgconfig_2.0.3
#> [36] pillar_1.9.0 bslib_0.8.0 gtable_0.3.6 Rcpp_1.0.13 glue_1.8.0
#> [41] xfun_0.49 tidyselect_1.2.1 highr_0.11 sys_3.4.3 knitr_1.48
#> [46] htmltools_0.5.8.1 rmarkdown_2.28 maketools_1.3.1 compiler_4.4.1