NEWS & EVENTS
Young Black professional woman at a computer and surrounded by virtual screens with charts and health information.

NHLBI Workshop: The Promise of NHLBI Data Science

July 20 - 21 , 2021
Virtual Workshop

Description

The National Institutes of Health’s (NIH) National Heart, Lung, and Blood Institute (NHLBI) hosted this workshop for researchers interested in collaborative, innovative data science and data-focused methods in heart, lung, blood, and sleep (HLBS) research domains. The objectives of this workshop were to provide opportunities for:

  • Biomedical researchers to learn the state-of-the-art data science methods, such as data and computing infrastructure, machine learning (ML), and artificial intelligence (AI) procedures.
  • Data scientists to understand the important questions in HLBS-related research.
  • All participants to learn about collaborative data science methods and HLBS discoveries through applications to NHLBI studies.
  • Developing research and education programs for training a new generation of data scientists with diverse backgrounds.

Data science is a high priority for NHLBI, which plans to be at the vanguard of research in this field. Nurturing the next generation of scientific leaders who will benefit from training in and experience with data science in HLBS research is an NHLBI priority. Another Institute priority is to be at the leading edge in data resources, platforms, tools, and training to ensure the HLBS research community has the capabilities needed to advance NHLBI’s agenda.

The NHLBI Trans-Omics for Precision Medicine (TOPMed) program is a central component of NHLBI’s data science activities and an example of its commitment to diversity and inclusion in data science. Researchers can gain access to TOPMed data through NHLBI’s BioData Catalyst, a data science ecosystem providing NHLBI data and computational resources to investigators around the world.

NHLBI Data Science Overview

Two objectives of the NHLBI Strategic Vision are particularly relevant to this workshop:

  • Leverage emerging opportunities in data science to open new frontiers in HLBS research.
  • Identify factors that account for individual differences in pathobiology and in responses to treatments.

NHLBI’s principles and values—such as inclusive excellence, a health equity framework, and multidisciplinary team science—guide its approach to harnessing data science for driving precision medicine.

Trustworthy Artificial Intelligence Systems and Role of Datasets and Data Scientists

In health care, AI applications are used for a broad range of purposes, including telemedicine, research, medical records, and medical imaging. AI applies experience and data to mimic human cognitive capabilities and calculates the probability that a hypothesis is true. Robust and trustworthy AI is important because ML algorithms and AI systems can be biased and misleading as a result of flawed learning models, input data, or both. Trustworthy AI is explainable; for example, it uses an explainable model and produces a rationale for its result. Other issues that need to be addressed to develop trustworthy AI models include compliance with statutory, regulatory, and ethical requirements as well as ways to address privacy and security risks.

Presentations on Datasets

BioData Catalyst is an ecosystem of integrated cloud-based platforms and services. The components include data-search and cohort-formation capabilities, reusable workflows, cloud-based secure workspaces, and imaging and AI tools and applications. Users can bring their own data and applications to projects within BioData Catalyst so they may keep elements of each project in one secure workspace. The platform also allows researchers to work with data stored in different cloud locations (including Google Cloud and Amazon Web Services) using a single interface. In addition to TOPMed, BioData Catalyst currently hosts a subset of BioLINCC datasets from sickle cell disease (SCD) studies and open-access data from the 1000 Genomes Project, as well as COVID-19 data from the Prevention & Early Treatment of Acute Lung Injury (PETAL) Network. New NHLBI data assets are continuously being added to the ecosystem.

Introduction to GWAS Resources in BioData Catalyst

BioData Catalyst provides access to over 100,000 Whole Genome Sequencing (WGS) records offering statistical power for discovery. In addition to the data hosted by BioData Catalyst, users can leverage data hosted by other NIH cloud ecosystems if they have access to those datasets. They can also upload data for which they have appropriate approval as long as doing so does not violate the terms of their data use agreements, limitations, or institutional review board policies. BioData Catalyst supports collaboration in secure cloud workspaces. Users can control who has access to each workspace, share their hosted data with other workspaces and collaborators, initiate interactive analysis sessions, and launch scalable workflows.

Getting Started on BioData Catalyst

The BioData Catalyst ecosystem is nimble and responsive to the ever-changing conditions of biomedical and data science communities. BioData Catalyst offers elastic computing in that users pay only for what they use, and it facilitates access to many high-value datasets and tools for data discovery. Users do not need to download and manage multiple large datasets or manage computer systems, and a help desk and wealth of support documents and tools are available.

New users can register and log in to BioData Catalyst and view the platform’s open-access data (e.g., 1000 Genomes Project) using their NIH eRA Commons identification information and then register for each BioData Catalyst platform separately. Those seeking to view and use controlled-access data need dbGaP approval for access to those data. BioData Catalyst provides information on how to submit a request to dbGaP for access to controlled-access data.

Users do not pay for the storage of hosted datasets, but they do pay for computation and storage of derived results. BioData Catalyst offers resources to help users include data costs in grant applications. The services page provides information on connecting to and learning about the platforms and services available in the BioData Catalyst ecosystem. BioData Catalyst is building a community of practice to collaboratively solve technical and scientific challenges. Peer-to-peer mentoring and a community forum are available, and office hours will soon be launched. In addition, webinars will be offered, and a Slack channel will be created for informal discussions. Users who join the ecosystem will receive an introductory email with links to resources, including the BioData Catalyst Getting Started Guide. New users begin by registering for the BioData Catalyst Community, which does not require an eRA Commons account. The community provides updates on new releases, upcoming events, and a range of opportunities.

Data Challenges Across Multiple Datasets and Novel Computational Methods

The computing and engineering community creates new data science algorithms but does not necessarily have expertise in the meaning of the data that it works with. Disciplinary barriers are high, and biomedical and data science departments are often on different campuses. Partnerships with the data science community require an understanding of the differences between this community and the biomedical science community as well as how to make collaborations successful.

Examples were provided to demonstrate the value of AI and ML for biomedical research:

  • A data-driven stratification and prognosis system for traumatic brain injury
  • A system that uses wearable sensors, smartphones, and other technology in the home to monitor family eating dynamics and support development of interventions to reduce obesity

Key challenges for applying AI in health care were identified:

  • Research using very small datasets that have biases and other weaknesses
  • Differences in approaches between computer scientists and biomedical researchers in how to collect, store, and analyze data
  • Privacy issues raised by the large amounts of health care data needed for some computer science and engineering applications

Biomedical researchers sometimes perceive AI and ML as black boxes. The onus of educating biomedical researchers about data science should not be the sole responsibility of data scientists; researchers also need to be proactive about educating themselves. For example, data scientists can create a list of features in an AI system for electronic health records (EHRs) to help physicians understand that the system provides the capabilities that they are seeking.

Safeguards needed before deploying systems that use EHR data in clinical practice are to identify the biases in health and biomedical data and develop algorithms to address these biases. EHR data have real value, but the question is how to learn fundamental principles from high-quality data so that AI systems can be deployed with EHR data that is lower quality and less consistent. Validation is always critical to ensure that an algorithm works as intended.

Toward Machine Learning for Personalized Health Care

ML has the opportunity to have a major impact in health care, but a challenge is to identify the right problems to address with ML and how to deploy ML so that it has the desired benefits. AI is useful for identifying patterns (e.g., in radiology images or genetic data), and investigators are developing methods to use AI to blend data from different sources and to offer real-time and understandable predictions. Challenges for AI in health care include hidden variables (such as social determinants of health) and how to code evolving diagnoses.

Gaps in ML include the need for better models, evaluation metrics, and ways to protect privacy (e.g., by using synthetic and aggregated data) and ensure trustworthiness (e.g., by making ML systems explainable, interpretable, fair, and robust). Different groups can have biases about health care access and quality of care, and these biases can be addressed in different ways.

ML systems can use fairness tools, which are designed to understand and mitigate differences in metrics across groups. Alternatively, data can be preprocessed to add fairness properties, and ML systems can be trained to adjust their predictions to ensure fairness. Most methods to address fairness in ML models are designed to ensure that these models behave similarly in different groups (e.g., genders and racial/ethnic groups). However, defining these groups in different ways could lead models to behave differently.

Many initiatives are trying to improve health care system data because data quality issues are often the source of challenges in ML systems. The fastest way to solve many fairness questions is to collect better data that have fewer disparity issues from the start.

Collecting data from a single source raises challenges with privacy and intellectual property. Federated learning is a set of tools that can be used to build and deploy ML models without collecting data in a single place, but these systems have challenges with data harmonization.

Knowledge-Guided Machine Learning in Disease Prediction with Big and Complex Data

Knowledge-guided ML is an important tool for constructing statistical predictive models of HLBS disease outcomes with large and complex datasets. Traditional statistical regression and modeling methods are often inappropriate when there are a large number of potential disease predictors, such as complex biomarker and imaging observations. ML tools, including random forests, regularized regression, and neural networks could be used to select biomarkers and imaging parameters for more accurate prediction of disease outcomes. For example, by combining ML tools with regression methods, accurate statistical models could be established to predict mortality and disease progression in patients with SCD and other HLBS disorders.

Machine Learning Tools for Synergistically Mining Complex Data and Prior Knowledge

A physics-informed neural network (PINN) is a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. A PINN uses at least two neural networks that share parameters and can be used in biomedical and physical applications. PINNs have been used to predict ruptures of brain aneurysms. PINNs can also be used to predict hypoglycemia, which can be fatal within 30 minutes in people with diabetes. In the database that NHLBI PIs effectively trained PINN with less than 10% of available data points corresponded with hypoglycemia.. Mixup is another method that linearly combines two samples to create a new sample.

OpenRBC was described as a software which is capable of simulating red blood cell lipid bilayer and cytoskeleton as modeled by multiple millions of mesoscopic particles. For example, it can identify sickle-shaped cells in people with SCD and monitor how blood cells clusters develop and circulate. It can also measure blood viscosity in silico using multiscale modeling. Thus, it has potential use in testing new therapies for SCD.

From Transcript to Tissue: Synthesis and Interpretation of Data Across Multiple Scales

There are opportunities to combine gene expression studies with physics-based models of disease. Gene ontology studies can provide a transcriptional profile of disease and reduce complex, poorly understood mechanisms to a small set of biological processes.. Research on Marfan syndrome shows how physics-based models can also shed light on disease processes. A physics-based model of Marfan shows the applied loads, changes in pressure and flow, and how these applied loads would affect the arterial wall at a tissue level (the level that is observed clinically with imaging) in people with Marfan syndrome who have hypertension. Physics-based models can be used for the fluid mechanics and for the wall mechanics, which can show the geometry and microstructural composition. Confidence in these physics-based models is high, but they need to be connected to the available transcriptional and even proteomic information. Thus, combining transcriptomics, proteomics, cell signaling, quantitative imaging, and computational methods can provide unprecedented insight into diseases like Marfan syndrome.

Interpreting Results from Genome-wide Searches—Experiences from TOPMed

The Trans-Omics for Precision Medicine (TOPMed) program, sponsored by the National Institutes of Health (NIH) National Heart, Lung and Blood Institute (NHLBI) aims to provide disease treatments tailored to an individual’s unique genes and environment. For example, TOPMed data have been used to search for gene variants associated with red blood cell quantitative traits. TOPMed data have been used to search for gene variants associated with red blood cell quantitative traits. In an analysis of WGS data from nearly 62,653 people, the investigators found 14 single variants at 12 genomic loci that were associated with seven red blood cell quantitative traits. Almost all WGS P-values use large-sample approximations. In many cases, approximations are good enough, but this is not the case in WGS, where signals are very small and the accuracy of approximations matters. Therefore, aggressive filtration is important. Saddle point approximations can eliminate spurious signals by improving approximations.

Importance of Diversity in STEM

National Science Board member Dr. Victor McCrary talked about the power of tapping into America's diversity to help address many of the challenges in STEM training. The Board serves as an independent body of advisors to both the President and Congress on policy matters related to science and engineering and education in science and engineering. Vision 2030 lays out what the National Science Board believes the United States must do to remain the world innovation leader in 2030. The Vision Roadmap lays out actions needed to achieve the vision and has four goals:

  • Foster a global science and engineering community.
  • Be competitive and attractive domestically.
  • Deliver benefits from research.
  • Develop STEM talent for America.

NSF’s 2021 report on Women, Minorities, and People with Disabilities in Science and Engineering shows that scientists and engineers with one or more disabilities have an unemployment rate greater than that of the U.S. labor force. In addition, Black science and engineering doctoral recipients report higher rates of paying for their graduate education through personal funds, including loans, than other racial groups, which sets up inequitable long-term financial burdens.

CONCLUSION

NHLBI hopes to broaden the pool of investigators with access to its data science resources; increase the diversity of these investigators; and bring people in data science, social science, and biological systems together to form teams that can address research questions related to HLBS priorities. Data science has the potential to enable the NHLBI research community to answer new questions, in ways that were previously not possible, to create meaningful impact for the health of the Nation.

Agenda

10:00AM-1:00PM
-
SESSION 1 CHAIR: JONATHAN KALTMAN, M.D.

10:00 – 10:10 AM
Welcome and Opening Remarks
Gary Gibbons, M.D.
Director, National Heart, Lung, and Blood Institute

10:10 – 10:20 AM
NHLBI Data Science Overview
David Goff, M.D., Ph.D.
Director, Division of Cardiovascular Sciences
National Heart, Lung, and Blood Institute

10:20 – 10:45 AM 
Keynote Address: Trustworthy AI Systems and Role of Datasets & Data Scientists
Danda Rawat, Ph.D.
Director, Data Science and Cybersecurity Center (DSC2)
Professor, Electrical Engineering and Computer Science
Howard University

10:45 AM - 12:00 PM 
Presentations on Datasets:
Getting on NHLBI BioData Catalyst Powered by Seven Bridges
Dave Roberson, B.S.
Community Engagement Manager, Biomedical Research Platforms
Seven Bridges

Community Engagement for Biomedical Research Platforms Seven Bridges
Alison Leaf, Ph.D.
Senior Program Manager
Seven Bridges

12:00 - 1:00 PM
Presentation: NHLBI-Generated Clinical and Genomic Big Data: Identify available genomic and clinical National Heart, Lung, Blood & Sleep Institute datasets and submit data access requests for analysis in the cloud.
Sweta Ladwa, M.P.H., P.M.P.
Senior Scientific Program Manager, Information Technology and Application Center (ITAC)
National Heart, Lung, and Blood Institute 

1:00 – 1:30 PM  
-
Lunch Break

1:30 – 5:30 PM
-
SESSION 2 CHAIR: ERIN ITURRIAGA, D.N.P., M.S.N., R.N.

1:30 – 3:00 PM
Introduction to Genome-Wide Association Studies (GWAS) resources in BioData Catalyst
Beth Sheets, M.S.
Program Manager
UC Santa Cruz Genomics Institute

Fayuan Wen, Ph.D.
Postdoctoral Associate
Howard University

3:00 – 4:00 PM
Open Discussion Time

4:00 – 4:45 PM
Getting Started on BioData Catalyst
Amber Voght
User Engagement Specialist
Renaissance Computing Institute at UNC (RENCI)

4:45 - 5:30 PM
Wrap-up Day 1: Data Challenges Across Multiple Datasets and Novel Computational Methods
Wendy Nilsen, Ph.D.
Program Director
Smart and Connected Health
Directorate for Computer & Information Science & Engineering
National Science Foundation

 

10:00AM - 1:00PM
-
SESSION 3 CHAIR: ASIF RIZWAN, Ph.D.

10:00 - 10:30 AM
Plenary Address: Towards Machine Learning for Personalized Healthcare
Sanmi Koyejo, Ph.D.
Assistant Professor, Department of Computer Science
University of Illinois at Urbana-Champaign

10:30 – 11:15 AM
Presentation: Application of Machine Learning and Artificial Intelligence Methods in Visualizing and Modelling of Complex Imaging and Clinical Data
Xin Tian, Ph.D.
Mathematical Statistician, Division of Intramural Research
National, Heart, Lung, and Blood Institute

Li-Yueh Hsu, D.Sc.
Staff Scientist, Radiology and Imaging Sciences
Clinical Center, National Institutes of Health

11:15 AM – 12:00 PM
Presentation: Machine Learning Tools for Synergistically Mining Complex Data and Prior Knowledge
George Em Karniadakis, Ph.D.
Professor of Applied Mathematics, Center for Fluid Mechanics
Brown University

12:00 – 12:45 PM
Presentation: From Transcript to Tissue: Synthesis and Interpretation of Data Across Scales
Jay Humphrey, Ph.D.
John C. Malone Professor of Biomedical Engineering
Department Chair, Biomedical Engineering
Yale University

Presentation: Interpreting Results from Genome-wide Searches – Experiences from TOPMed
Ken Rice, Ph.D.
Professor, Department of Biostatistics
University of Washington

12:45 – 1:30 PM
-
Lunch Break

1:30 – 4:45 PM
-
SESSION 4 CHAIR: COLIN WU, Ph.D.

1:30 – 3:00 PM
Case Study Presentations (BioData Catalyst Fellows)
Case Study 1 – Blood (genetic risk of allergic disease)
Michelle Daya, Ph.D.
University of Colorado Denver
Project: HLA and Genome-Wide Association Studies of Total Serum IgE Levels

Case Study 2 – Heart (atrial fibrillation)
Seung Hoan Choi, Ph.D.
Broad Institute of MIT and Harvard
Project: Genetic Architecture and Contribution of Rare Mutations to Atrial Fibrillation Risk

Case Study 3 – COPD (imaging phenotypes)
Dandi Qiao, Ph.D.
Brigham and Women’s Hospital
Project: Whole Genome-Sequencing Analyses of Imaging Phenotypes of Chronic Obstructive Pulmonary Disease (COPD)

Case Study 4 – Sickle Cell Disease (iron overload)
Fayuan Wen, Ph.D.
Howard University
Project: Association Study of Iron Overload in Sickle Cell Disease Population Using NHLBI WGS from TOPMed

3:00 – 3:30 PM
Plenary Address: National Science Board vision 2030 - The Importance of Diversity in STEM
Victor McCrary, Jr., Ph.D.
Vice President, Research and Graduate Programs, University of the District of Columbia
Chair, National Science Board

3:30 – 4:30PM
Panel Discussion: Future Needs and Directions of Data Science in HBLS Research
Jonathan Kaltman, M.D.
Senior Scientific Advisor/Lead in Data Science
National Heart, Lung, and Blood Institute

Asif Rizwan, Ph.D.
Program Officer
Division of Blood Diseases and Resources
National Heart, Lung, and Blood Institute

Colin Wu, Ph.D.
Program Officer/Math Statistician
Office of Biostatistics Research
National Heart, Lung, and Blood Institute

4:30 – 4:45 PM        
Wrap Up for Organizers
Erin Iturriaga, D.N.P., M.S.N., R.N.
Program Officer/Clinical Trials Specialist
Atherothrombosis and Coronary Artery Disease Branch
Division of Cardiovascular Sciences
National Heart, Lung, and Blood Institute