The following content addresses the preparation of datasets and associated documentation from NHLBI-funded clinical research studies for submission to NHLBI repositories, such as NHLBI BioData Catalyst® (BDC), including datasets to be submitted with corresponding biospecimen collections. Data submitters should also review the Instructions for Data Submission to BDC.
These instructions will be updated to support new scientific advancements and as new information and processes are made available.
Scientific Data is defined as data commonly accepted in the scientific community as of sufficient quality to validate and replicate research findings, regardless of whether the data are used to support scholarly publications.
- Scientific data includes any data needed to validate and replicate research findings.
- Scientific data does not include laboratory notebooks, preliminary analyses, completed case report forms, drafts of scientific papers, plans for future research, peer reviews, communications with colleagues, or physical objects such as laboratory specimens.
For NHLBI clinical studies, data may also include information collected and recorded from study participants through periodic examinations; measurements from biospecimens; quantitative results from procedures such as imaging studies, exercise tests, lung function assessments, etc.; clinical event surveillance and follow-up contacts.
Study documentation is defined as descriptive information regarding the conduct of the study and collection of data. Study documentation may include study protocols, manual of operations or manual of procedures, annotated data collection forms, codebooks or data dictionaries, algorithms for calculated or derived data elements, and descriptions of data derived from procedures or biospecimens. This includes additional information intended to make scientific data interpretable and reusable (e.g., date, independent sample and variable construction and description, methodology, data provenance, data transformations, any intermediate or descriptive observational variables).
Overview of Responsibilities in Preparing Datasets for Submission to NHLBI
Investigators conducting NHLBI studies must comply with the data management and sharing policies and supplements effective at the time funding applications were submitted, and those included in the terms and conditions of their awards.
- For applications submitted prior to January 25, 2023, refer to the NHLBI Policy for Data Sharing from Clinical Trials and Epidemiological Studies.
- For applications submitted between January 25, 2023, and May 24, 2023, refer to the NIH Data Management and Sharing Policy.
- For applications submitted on or after May 25, 2023, refer to the NHLBI Supplement to the NIH Policy for Data Management and Data Sharing.
Investigator responsibilities include taking measures to reduce the likelihood that any individual participant can be identified, such as the elimination of personal identifiers. These measures safeguard privacy and honor the informed consent of research participants. In addition, if tiered consent was utilized by the study, investigators must provide the NHLBI with a list of participant identification numbers with data fields aligning with their consent group indicating:
- Participants who asked that their data not be shared beyond the initial study investigators (if applicable)
- Participants who asked that their data not be used for commercial purposes (if applicable)
- Participants who asked that use of their data be restricted to specific types of research activities (if applicable)
If possible, alignment with existing NIH Data Use Limitations (DULs) is encouraged.
Additional requirements include the provision of documentation and key study documents (protocol, data collection forms, manuals of procedures, etc.) that will enable the use of prepared datasets by outside investigators.
Datasets and associated documentation must be provided in machine-readable format to the NHLBI. Acceptable formats for submission of data include American Standard Code for Information Interchange (ASCII) text with comma or tab delimiters or other machine-readable formats. See more information on format requirements below.
Investigators conducting ancillary studies based on ongoing (parent) studies with funding applications submitted on or after May 25, 2023, are subject to the NHLBI Supplement to the NIH Policy for Data Management and Sharing. Ancillary studies with applications submitted between January 25, 2023, and May 24, 2023, are subject to the NIH Data Management and Sharing Policy. Ancillary studies with applications submitted before January 25, 2023, that are subject to the NHLBI Policy for Data Sharing from Clinical Trials and Epidemiological Studies must also submit ancillary study data to the NHLBI through the parent study coordinating center or data submission process established by the parent study. For these studies and any other longitudinal cohort study, a linkage file to map the same participant across all shared study components (i.e., biospecimens, image files, physiological signal files) is required for submission by the parent study coordinating center (see 1.b. below).
Types of Clinical Data to be Included in Datasets Submitted to the NHLBI
- Clinical Trials – datasets should include baseline, interim visit(s), ancillary data, procedural-based data, and outcome data, along with laboratory measurements not otherwise summarized.
- Observational Epidemiology Studies – datasets should include all of the examination data obtained in each examination cycle, ancillary data, and/or all of the follow-up information available up to the last follow-up cycle cutoff date.
Data from scored or procedural assessments (e.g., food item data, psycho-social questionnaires, individual electrocardiographic lead scores, etc.) should be included for each participant with both raw data elements and summary information.
De-Identifying Clinical Datasets for Submission to NHLBI
Data must be provided in a manner that protects the privacy of study participants. Steps taken to protect participant privacy in preparing a dataset must be documented and submitted (e.g., age Winsorizing, date shifting, etc.). Data submitters are encouraged to provide the BDC-Data Management Core (DMC) with a summary of all proposed modifications and deletions to be made to a dataset prior to their implementation.
The following is a framework for decision-making regarding preparation of datasets:
- Participant identifiers (IDs):
- Remove/recode obvious identifiers and those considered to be protected health information (PHI)/personally identifiable information (PII) (e.g., name, addresses, social security numbers, place of birth, city of birth, contact data). For geospatial data and analysis, some location detail may be required to be preserved. Please contact the BDC-Data Management Core regarding BDC data protections for geospatial data and the submission process.
- Replace original identification numbers with new, randomized identification numbers to preserve privacy with existing data both in terms of the identification number and order within the data, while still retaining linkage information to facilitate multi-modal data linkages, especially for longitudinal cohort studies with multiple data modalities (e.g., clinical, omics, imaging, sleep, etc.). IDs/codes linking the new and original participant identifier should be sent to the NHLBI in a separate file along with data fields indicating relevant consent and/or data use limitations (i.e., commercial use restriction Yes/No) so that linkage may be made if allowed and necessary for future research.
- Dates: All dates should be coded relative to a specific reference point (e.g., date of randomization or study entry), with documentation of the date obfuscation method. This provides privacy protection for individuals known to be in a study and to have had some significant event (e.g., a myocardial infarction) on a particular date.
- Variables that are administrative, sensitive in nature, or related to centers in multicenter studies:
- Clinical center identifier – Do not include center identifiers in trials or studies that have relatively few participants per center (e.g. less than 10 or approximately 5%), as this could facilitate the identification of participants. In trials that have either many centers or a large number of participants per center, the identifier may offer little possibility of identifying individuals, and investigators and the NHLBI will determine on a case-by-case basis whether to include them.
- Do not include interviewer or technician identification numbers, batch numbers, or other administrative data, as this could facilitate the identification of participants.
- If it is not directly relevant to the original aims of the study, obfuscate or remove sensitive data, including incarcerations, illicit drug use, mental illness, risky behaviors (e.g., carrying a gun or exhibiting violent behavior), sexual attitudes or behaviors, and selected medical conditions (e.g., alcohol use disorders, HIV/AIDS).
- Delete regional variables with little or no variation within a study center because they could be used to identify that center.
- Unedited, verbatim responses that are stored as text data (e.g., specified in the "other" category) should be edited to remove any embedded dates, names, or geographic identifiers (hospital names, city names, etc.).
- Group or recode variables with low frequencies for some values that might be used to identify participants (traits with visual or casual knowledge component). These might include:
- Socioeconomic and demographic data (e.g., marital status, occupation, income, education, language, number of years married).
- Household and family composition (e.g., number in household, number of siblings or children, ages of children or step-children, number of brothers and sisters, relationships, spouse in study).
- Number of pregnancies, births, or multiple children within a birth.
- Physical characteristics (e.g., missing limbs, blindness).
- Detailed medication, hospitalization, and cause of death codes, especially those related to sensitive medical conditions as listed above, such as HIV/AIDS or psychiatric disorders.
- Prior medical conditions with low frequency (e.g., group-specific cancers into broader categories) and related questions such as age at diagnosis and current status.
- Parent and sibling medical history (e.g., parents' ages at death).
- Race/ethnicity information when very few participants are in certain groups or cells.
- Data elements with no visual or casual knowledge component or that cannot be linked to existing databases should not be modified. For those data elements that do require modification, suggested approaches include:
- Polychotomous variables: values or groups should be collapsed so as to ensure a minimum number of participants (at least 10 or approximately 5%, whichever is less) for each value within a categorical cell.
- Dichotomous variables: data may either be grouped with other related variables so as to ensure a minimum number of participants (at least 10 or approximately 5%) in a specific cell or deleted.
- If investigators think other variables may also facilitate the identification of participants, they should consult the BDC-Data Management Core about recoding/removing such elements.
Documentation for datasets must be comprehensive and sufficiently clear to enable investigators who are not familiar with a dataset to use it. The documentation must include data collection forms, study procedures and protocols, data dictionaries, and algorithms for calculated data elements, descriptions of all variable recoding performed, and a list of all study publications to date.
Data dictionaries should utilize NIH Common Data Elements (CDEs) or other relevant vocabularies/ontologies (e.g., Clinical Data Interchange Standards Consortium (CDISC) for Clinical Trials or the Observational Medical Outcomes Partnership (OMOP) for clinically-derived data). The submission should comply with the data formats outlined in the Data Management and Sharing Plan (e.g., data type, standards, etc.) agreed to with the Program Official at the start of the study. Data Dictionaries should also be in a machine-readable format such as .csv or .json file format. PDF scans of Data Dictionaries are not acceptable.
In addition, a summary documentation file is required. This file must provide a complete overview of the data and a description of their use for investigators who are not familiar with the dataset. It must also contain a brief description of the study (including a general orientation to the study, its components, and its examination and follow-up timeline), a listing of all files being provided, and, if applicable, generation program code for machine-readable data files or formats (e.g. .csv or tab-delimited).
Selected study documentation will be used to describe the study to promote Findable, Accessible, Interoperable, and Reusable (FAIR) data sharing principles, for example, on an external facing NHLBI website. Examples include Forms, Descriptive Statistics, and the Study Protocol. Study documents must be submitted in a format that is 508 compliant, making it accessible to those with disabilities. The DHHS maintains a website devoted to 508 issues with links to resources on creating and checking accessibility at http://www.hhs.gov/web/508/index.html.
Format, Storage, and Delivery of Study Data and Materials
The required format for the data is .csv or other machine-readable formats (e.g. tab-delimited). If submitting Statistical Analysis System (SAS) files, a converted version of the data must also accompany the submission in either .csv or tab-delimited format for machine readability. Each data element must be provided with descriptive labels and preferably align with existing, relevant standard vocabularies and ontologies.
The data dictionary or codebook must be in a machine readable format. All other comprehensive documentation and the summary documentation must be prepared in a consistent format, either as Microsoft Word, ASCII, or portable document format (PDF) files and included on the same storage medium as the dataset. To ensure access by users with disabilities, all PDF files must be created in Adobe Acrobat version 5.0 or higher. Documentation that is not available in electronic form, such as data collection forms, should be scanned into a graphics file, converted to a PDF file using Adobe Acrobat version 5.0 or higher, and saved on the same medium as the dataset. Pedigree data should be provided in a format readable by standard genetic analysis programs such as SAGE and SOLAR, with one individual's data per line beginning with pedigree identifier, individual's ID, father's ID, mother's ID, and individual's sex.
Data, and updates to data should additional data be added or errors are incurred, and study documentation materials should be sent to the NHLBI within the timelines described in relevant policies and policy supplements. For studies with funding applications submitted prior to January 25, 2023, refer to the NHLBI Policy for Data Sharing from Clinical Trials and Epidemiological Studies. For studies with funding applications submitted between January 25, 2023, and May 24, 2023, refer to the NIH Data Management and Sharing Policy. For studies with funding applications submitted on or after May 25, 2023, refer to the NHLBI Supplement to the NIH Policy for Data Management and Data Sharing.
The following links highlight NIH policy and related guidance on sharing of research data developed with NIH funding.