Illustration of geometrically connected circle icons superimposed over a computer tablet with a stethoscope on top.

Big Data Integration: Unlocking the Potential for Enhanced Epidemiological Research

September 27 - 28 , 2023
Virtual Zoom Workshop
Wednesday, September 27, 2023: 10:00 a.m. – 4:30 p.m. EDT
Thursday, September 28, 2023: 9:30 a.m. – 4:00 p.m. EDT


In September 2023, the National Institutes of Health’s (NIH) National Heart, Lung, and Blood Institute (NHLBI), in collaboration with the National Cancer Institute (NCI), and the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) hosted a two-day virtual workshop titled "Big Data Integration: Unlocking the Potential for Enhanced Epidemiological Research". The workshop convened leading experts to explore the advancements and challenges in integrating large-scale data for epidemiological research, with a particular focus on emerging opportunities presented by big data and artificial intelligence (AI) technologies. 


The rapid growth of data in the medical and research fields, coupled with advancements in AI tools and data processing capabilities, has significantly expanded the horizons for epidemiological research. This growth presents a unique opportunity for developing effective strategies to integrate and harness big data, with the aim of translating it into actionable insights for risk prediction, diagnosing and treating chronic diseases, as well as identifying gaps and opportunities for future research.

However, managing the volume and diversity of this data presents significant challenges. The complexity of integrating varied data sources, ensuring the quality and reliability of data, and addressing ethical and equity considerations in data usage are important issues that require further research. Additionally, the rapid pace of technological change requires continuous adaptation and innovation in research methodologies and strategies.

Convened to address these emerging challenges and opportunities, this workshop focused on leveraging big data for the advancement of epidemiological research, particularly in chronic diseases. It aimed to explore and develop strategies that could harness the vast potential of big data, leading to improved healthcare outcomes and bridging gaps in health equity.


  • Understand the current landscape of big data in healthcare and its implications for epidemiological research.
  • Discuss the role and potential of AI in healthcare, emphasizing its capabilities and limitations.
  • Explore strategies for effective data integration while ensuring data privacy and security.
  • Highlight research opportunities presented by large-scale datasets and biobanks.
  • Foster collaboration among stakeholders to drive innovation in the field.

Main Discussion Topics

In the realm of epidemiological research, the workshop presented the growing importance of integrating multifaceted data from diverse sources. This integration is fundamental, not just for enriching health research but also for shaping the future of public health, policy, and clinical practice. By weaving together data from different fields, we gain a more comprehensive understanding of health and disease, enhancing our views and informing better strategies.

A significant portion of the discussions revolved around the strategic use of existing cohort study data, electronic health records (EHRs), and other real-world data sources. This combination is seen as an opportunity for driving scientific discovery, particularly in enhancing public health outcomes with a keen focus on health equity. Collaborative data sharing initiatives, like The Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium and Atherosclerosis Risk in Communities (ARIC) study which were presented in this workshop, stand out as important components for effective public health analytics. These efforts contribute to a robust foundation for understanding health trends and outcomes, especially in common chronic and rare diseases, and are particularly vital for underrepresented minority populations.

The recent COVID-19 pandemic has been a catalyst for the adoption of digital health technologies, demonstrating the value of data from wearable sensors and smart devices. These technologies are unlocking novel insights into human health behaviors and outcomes, paving the way for new health monitoring and intervention strategies. This surge in digital health data, while promising, brings to the forefront the need for validating digital biomarkers and ensuring regulatory oversight. Policy initiatives supporting the integration of digital health data into clinical practice must adhere to FAIR (Findable, Accessible, Interoperable, and Reusable) principles, balancing innovation with the data integrity and privacy concerns. In this context, synthetic data is emerging as a promising avenue to address privacy issues.

Parallel to these developments, AI is reshaping patient diagnostics, risk prediction and care. Advanced models like GPT-4 and machine vision are not only reducing diagnostic errors but sometimes even surpassing human capabilities. The rise of multimodal AI, integrating diverse data types, presents a new era of personalized medicine. However, this reliance on AI introduces ethical and practical concerns, including trust, data quality, misinformation, and equity. A collaborative approach, bridging academia and industry, is essential to ensure equitable access to these transformative technologies. In addition, accuracy and incremental gain should be evaluated quantitatively, incorporating established principles for validation studies.

The workshop presented a shift in perspective towards health data, highlighting the importance of data quality rather than volume. Large datasets, while useful for advance analytical methods, may contain biases that could influence research outcomes. It is important to ensure health outcomes are accurately connected to relevant data sources, advocating for enhancements in data collection and algorithm design to achieve more precise and equitable health assessments.

As we broaden our data collection and integration efforts, the ethical considerations must remain at the forefront. This is necessary to ensure that the advancements in healthcare are beneficial to all. Secure and well-designed platforms for patients to share their health data safely are integral, both for streamlining research and for delivering more effective healthcare. As health IT integrates larger datasets, privacy and reidentification risks must be mitigated. The success of interoperability hinges on emerging IT standards (ONC’s USCDI, FHIR, HL7, LOINC and SNOMED, among others) and support biomedical research networks, which are key to enhancing health information exchange. 

Finally, acknowledging that some health determinants begin prenatally, integrating and analyzing a wide array of data types – from passive government records to actively collected digital footprints and research data – is fundamental. This approach is vital for crafting data-driven, informed healthcare policies and interventions that span a person's entire lifespan. Initiatives like NIH’s All of Us, the Million Veteran Program and the UK Biobank, emphasize the importance of inclusive data access for comprehensive health research. Ensuring a diverse collection of large-scale health data is essential, not only for inclusive biomedical research but also for the education of early-career researchers, clinicians, and healthcare providers. Leveraging data insights for improved patient care is the overarching goal of these discussions.

Gaps & Opportunities

Data Ethics, Privacy, and Governance

  • Ethical Frameworks: Develop ethical frameworks that not only comply with current regulations but also anticipate future challenges in data usage, ensuring that ethical considerations are integrated into healthcare research from the outset.
  • Advanced De-identification: Innovate de-identification techniques that employ the latest advancements in cryptography and data science, ensuring that shared data remains a powerful tool for research without compromising individual privacy.
  • Risk Communication: Implement clear, transparent, and accessible risk communication strategies, empowering participants to make informed decisions about their involvement in research, thereby building trust and encouraging participation.

Predictive Health Analytics and Inclusivity

  • Data Collection Tools: Encourage the development and use of both passive and real-time data collection tools, like wearable technology and mobile health apps, for more comprehensive health data gathering.
  • Representative Predictive Models: Support the creation of predictive models that are built on datasets that represent diverse populations, ensuring equitable and universally applicable health predictions and interventions. 
  • Healthcare Accessibility: Utilize big data to identify and address healthcare accessibility issues, focusing on underserved and rural populations.
  • Mental Health Data Integration: Integrate mental health data into broader health research, understanding its impact on overall health outcomes.

Clinical and Translational Research

  • Biomarker Specificity: Direct resources towards identifying and validating disease-specific biomarkers, enhancing early and accurate diagnoses.
  • Translational Pathways: Strengthen translational research pathways, ensuring swift and effective translation of clinical study findings into practical healthcare applications and improved patient outcomes.

Longitudinal Studies and Sustainable Funding

  • Longitudinal Research: Support the significance of longitudinal studies in understanding disease progression and long-term treatment effects, impacting both policy and clinical practice.
  • Funding Models: Develop innovative funding models that provide stable, long-term support for health research, ensuring that valuable studies are not shortened due to financial constraints.

Collaboration, Education, and Regulatory Engagement

  • Research Networks: Foster robust networks that connect researchers, practitioners, and policymakers, facilitating the exchange of knowledge and resources, and amplifying the impact of research.
  • Open Science and Data Sharing: Promote open science principles by encouraging the sharing of algorithms and code underlying key research findings and clinical practices. Develop secure and privacy-preserving methods for sharing data used in these analyses, enhancing transparency and reproducibility in research.
  • Educational Initiatives: Invest in educational initiatives that prepare the next generation of researchers to effectively harness the power of digital health data.

Global Health Impact and Participant Engagement

  • Environmental Health Research: Expand the scope of health research to include the study of environmental exposures, recognizing their significant impact on global health and informing international health policies.
  • Engagement Strategies: Develop innovative participant engagement strategies to benefit from diverse experiences and insights, leading to more effective health interventions.
Workshop Co-Chairs: 

Josef Coresh, MD, PhD - Professor, Johns Hopkins University
Jessilyn Dunn, PhD - Assistant Professor, Biomedical Engineering and Biostatistics & Bioinformatics, Duke University

Workshop Planning Committee:

NHLBI: Gabriel Anaya, MD, MSc; Yuling Hong, MD, MSc, PhD; Erin Iturriaga, DNP, MSN, RN; Asif Rizwan, PhD; Eric Shiroma, ScD, MEd, FACSM; Melissa Garcia, MPH; and Alfonso Alfini, PhD, MS

NCI: Dana Wolff-Hughes; PhD; and Marissa Shams-White, PhD, MSTOM, MS, MPH

OD: Sheri Schully, PhD; and Rebecca Krupenevich, AAAS Fellow

NCATS: Josh Fessel, MD, PhD

NIDDK: Ivonne Schulman MD; Jean M. Lawrence, ScD, MPH, MSSA; Kevin Abbott, MD, MPH; Neha Shah, MSPH; Aynur Unalp-Arida, MD, MSc, PhD; and Kenneth J. Wilkins, PhD