Guidelines for Preparing Clinical Study Data Sets for Submission to the NHLBI Data Repository


This document addresses the preparation of data sets and associated documentation from NHLBI-funded clinical studies for deposition into the NHLBI Data Repository.


Data - Information collected and recorded from study participants through: periodic examinations; measurements from biospecimens; quantitative results from procedures such as imaging studies, exercise tests, lung function assessments, etc.; clinical event surveillance and follow-up contacts.

Study documentation – Descriptive information regarding the conduct of the study and collection of Data. Study documentation may include: study protocol, manual of operations or manual of procedures, annotated data collection forms, codebooks or data dictionary, algorithms for calculated or derived data elements, and descriptions of data derived from procedures or biospecimens.



Overview of Responsibilities in Preparing Data Sets for the NHLBI Data Repository

Investigators conducting NHLBI studies subject to the NHLBI Policy for Data Sharing from Clinical Trials and Epidemiological Studies may be required as part of the terms and conditions of their awards to prepare and deliver to the NHLBI data sets that satisfy NHLBI requirements. This includes measures to reduce the likelihood that any individual participant can be identified, such as the elimination of personal identifiers. These measures safeguard privacy and honor the informed consent of research participants. Additional requirements include the provision of documentation and key study documents (protocol, data collection forms, manuals of procedures, etc.) that will enable the use of prepared data sets by outside investigators.

Data sets and associated documentation must be provided in electronic form to the Institute.  In addition, if a tiered consent was utilized by the study, investigators must provide the Institute with a list of participant identification numbers with data fields indicating: 

  • Participants who asked that their data not be shared beyond the initial study investigators (if applicable)
  • Participants who asked that their data not be used for commercial purposes (if applicable)
  • Participants who asked that use of their data be restricted to specific types of research activities (if applicable)

Investigators conducting ancillary studies based on ongoing (parent) studies that are subject to the NHLBI Data Sharing Policy must also submit ancillary study data to the NHLBI through the parent study coordinating center or data submission process established by the parent study.

Types of Data to be Included in Data Sets Submitted to the NHLBI Data Repository

  • Clinical Trials – data sets should include baseline, interim visit, ancillary data, procedural based data, and outcome data, along with laboratory measurements not otherwise summarized.
  • Observational Epidemiology Studies – data sets should include all of the examination data obtained in each examination cycle, ancillary data, and/or all of the follow-up information available up to the last follow-up cycle cutoff date.

Data from scored or procedural assessments (e.g., food item data, psycho-social questionnaires, individual electrocardiographic lead scores, etc.) should include for each participant both raw data elements and summary information where feasible.

Guidelines for Redaction of Data Sets for Submission to the NHLBI Data Repository

Data must be provided in a manner that protects the privacy of study participants. Steps taken to protect participant privacy in preparing a data set must be documented. A summary of all proposed modifications and deletions to be made to a data set should be submitted to and approved by the NHLBI Data Repository representative prior to their implementation.

The following guidelines provide a framework for decision-making regarding preparation of data sets:

  1. Participant identifiers:
    • Delete obvious identifiers (e.g., name, addresses, social security numbers, place of birth, city of birth, contact data).
    • Replace original identification numbers with new, randomized identification numbers to sever linkage with existing data both in terms of the identification number and order within the data. Codes linking the new and original participant identifier should be sent to the NHLBI in a separate file along with data fields indicating relevant consent restrictions (i.e. commercial use restriction Yes/No), so that linkage may be made if allowed and necessary for future research.
  2. Dates: All dates should be coded relative to a specific reference point (e.g., date of randomization or study entry). This provides privacy protection for individuals known to be in a study and to have had some significant event (e.g., a myocardial infarction) on a particular date.
  1. Variables that are administrative, sensitive in nature, or related to centers in multicenter studies:
    • Clinical center identifier – Do not include center identifiers in trials or studies that have only a few centers and/or relatively few participants per center, as this could facilitate identification of participants. In trials that have either many centers or a large number of participants per center, the identifier may offer little possibility of identifying individuals, and investigators and the NHLBI will determine on a case-by-case basis whether to include them.
    • Delete interviewer or technician identification numbers, batch numbers, or other administrative data, as this could facilitate identification of participants.
    • If it is not directly relevant to the original aims of the study, delete sensitive data, including incarcerations, illicit drug use, mental illness, risky behaviors (e.g., carrying a gun or exhibiting violent behavior), sexual attitudes or behaviors, and selected medical conditions (e.g., alcohol use disorders, HIV/AIDS).
    • Delete regional variables with little or no variation within a center, because they could be used to identify that center.
  2. Unedited, verbatim responses that are stored as text data (e.g., specified in "other" category) should be deleted or edited to remove any embedded dates, names, or geographic identifiers (hospital names, city name, etc.).
  3. Group or recode variables with low frequencies for some values that might be used to identify participants (traits with visual or casual knowledge component). These might include:
    • Socioeconomic and demographic data (e.g., marital status, occupation, income, education, language, number of years married).
    • Household and family composition (e.g., number in household, number of siblings or children, ages of children or step-children, number of brothers and sisters, relationships, spouse in study).
    • Number of pregnancies, births, or multiple children within a birth.
    • Anthropometric measures (e.g., height, weight, waist girth, hip girth, body mass index).
    • Physical characteristics (e.g., missing limbs, blindness).
    • Detailed medication, hospitalization, and cause of death codes, especially those related to sensitive medical conditions as listed above, such as HIV/AIDS or psychiatric disorders.
    • Prior medical conditions with low frequency (e.g., group specific cancers into broader categories) and related questions such as age at diagnosis and current status
    • Parent and sibling medical history (e.g., parents' ages at death).
    • Race/ethnicity information when very few participants are in certain groups or cells.
  4. Data elements with no visual or casual knowledge component or that cannot be linked to existing databases should not be modified. For those data elements that do require modification suggested approaches include:  
    • Polychotomous variables: values or groups should be collapsed so as to ensure a minimum number of participants (e.g., at least 15-20 or approximately 5%, whichever is less) for each value within a categorical cell.
    • Dichotomous variables: data may either be grouped with other related variables so as to ensure a minimum number of participants (e.g., at least 15-20 or approximately 5%) in a specific cell or deleted.
    • If investigators think other variables may also facilitate identification of participants, they should consult the NHLBI Data Repository representative about recoding/removing such elements.

Study Documentation

Documentation for data sets must be comprehensive and sufficiently clear to enable investigators who are not familiar with a data set to use it. The documentation must include data collection forms, study procedures and protocols, data dictionaries and algorithms for calculated data elements, descriptions of all variable recoding performed, and a list of all study publications to date.

In addition, a summary documentation file is required and must provide a complete overview of the data and a description of their use for investigators who are not familiar with the data set. It must also contain a brief description of the study (including a general orientation to the study, its components, and its examination and follow-up timeline), a listing of all files being provided, and, if applicable, generation program code for SAS-formatted data files or formats.

Selected study documentation will be used to describe the study on the Data Repository website. Examples include Forms, Data Dictionaries, Descriptive Statistics, and the Study Protocol. These documents should be accessible to those with disabilities according to section 508 of the Rehabilitation Act. The DHHS maintains a website devoted to 508 issues with links to resources on creating and checking accessibility at

Format, Storage, and Delivery of Study Data and Materials

A checklist to aid in the preparation of study data and documentation is available on the BioLINCC website.  A study details worksheet and informed consent questionnaire must be submitted with the study materials and is also available from the BioLINCC website.

The preferred format for the data is SAS; however, data prepared using any standard statistical software (i.e. SPSS, Stata) or as an ASCII text file, is acceptable.  Data submitted in ASCII format must include SAS generation code to convert the file to SAS format.  Regardless of data submission format, each data element must be provided with descriptive labels.

Both the comprehensive documentation and the summary documentation must be prepared in a consistent format, either as Word Perfect, MS Word, ASCII, or portable document format (PDF) files and included on the same storage medium as the data set. To ensure access by users with disabilities, all PDF files must be created in Adobe Acrobat version 5.0 or higher. Documentation that is not available in electronic form, such as data collection forms, should be scanned into a graphics file, converted to a PDF file using Adobe Acrobat version 5.0 or higher, and saved on the same medium as the data set. Pedigree data should be provided in a format readable by standard genetic analysis programs such as SAGE and SOLAR, with one individual's data per line beginning with pedigree identifier, individual's ID, father's ID, mother's ID, and individual's sex.

Data and study documentation materials should be sent to the NHLBI prior to the end of funding and within the timelines described in the NHLBI Policy for Data Sharing from Clinical Trials and Epidemiological Studies


The following links highlight NIH policy and related guidance on sharing of research data developed with NIH funding.


NHLBI Policy for Data Sharing from Clinical Trials and Epidemiological Studies
Biological Specimen and Data Repository Information Coordinating Center - BioLINCC


For questions and/or concerns regarding the content of this page, please contact the NHLBI Data Repository representative.

Sean Coady
6701 Rockledge Dr.
Bethesda, MD 20892
(301) 435-1289