
Description
The National Institutes of Health’s (NIH) National Heart, Lung, and Blood Institute (NHLBI) hosted this workshop for researchers interested in collaborative, innovative data science and data-focused methods in heart, lung, blood, and sleep (HLBS) research domains. The objectives of this workshop were to provide opportunities for:
- Biomedical researchers to learn the state-of-the-art data science methods, such as data and computing infrastructure, machine learning (ML), and artificial intelligence (AI) procedures.
- Data scientists to understand the important questions in HLBS-related research.
- All participants to learn about collaborative data science methods and HLBS discoveries through applications to NHLBI studies.
- Developing research and education programs for training a new generation of data scientists with diverse backgrounds.
Data science is a high priority for NHLBI, which plans to be at the vanguard of research in this field. Nurturing the next generation of scientific leaders who will benefit from training in and experience with data science in HLBS research is an NHLBI priority. Another Institute priority is to be at the leading edge in data resources, platforms, tools, and training to ensure the HLBS research community has the capabilities needed to advance NHLBI’s agenda.
The NHLBI Trans-Omics for Precision Medicine (TOPMed) program is a central component of NHLBI’s data science activities and an example of its commitment to diversity and inclusion in data science. Researchers can gain access to TOPMed data through NHLBI’s BioData Catalyst, a data science ecosystem providing NHLBI data and computational resources to investigators around the world.
NHLBI Data Science Overview
Two objectives of the NHLBI Strategic Vision are particularly relevant to this workshop:
- Leverage emerging opportunities in data science to open new frontiers in HLBS research.
- Identify factors that account for individual differences in pathobiology and in responses to treatments.
NHLBI’s principles and values—such as inclusive excellence, a health equity framework, and multidisciplinary team science—guide its approach to harnessing data science for driving precision medicine.
Trustworthy Artificial Intelligence Systems and Role of Datasets and Data Scientists
In health care, AI applications are used for a broad range of purposes, including telemedicine, research, medical records, and medical imaging. AI applies experience and data to mimic human cognitive capabilities and calculates the probability that a hypothesis is true. Robust and trustworthy AI is important because ML algorithms and AI systems can be biased and misleading as a result of flawed learning models, input data, or both. Trustworthy AI is explainable; for example, it uses an explainable model and produces a rationale for its result. Other issues that need to be addressed to develop trustworthy AI models include compliance with statutory, regulatory, and ethical requirements as well as ways to address privacy and security risks.
Presentations on Datasets
BioData Catalyst is an ecosystem of integrated cloud-based platforms and services. The components include data-search and cohort-formation capabilities, reusable workflows, cloud-based secure workspaces, and imaging and AI tools and applications. Users can bring their own data and applications to projects within BioData Catalyst so they may keep elements of each project in one secure workspace. The platform also allows researchers to work with data stored in different cloud locations (including Google Cloud and Amazon Web Services) using a single interface. In addition to TOPMed, BioData Catalyst currently hosts a subset of BioLINCC datasets from sickle cell disease (SCD) studies and open-access data from the 1000 Genomes Project, as well as COVID-19 data from the Prevention & Early Treatment of Acute Lung Injury (PETAL) Network. New NHLBI data assets are continuously being added to the ecosystem.
Introduction to GWAS Resources in BioData Catalyst
BioData Catalyst provides access to over 100,000 Whole Genome Sequencing (WGS) records offering statistical power for discovery. In addition to the data hosted by BioData Catalyst, users can leverage data hosted by other NIH cloud ecosystems if they have access to those datasets. They can also upload data for which they have appropriate approval as long as doing so does not violate the terms of their data use agreements, limitations, or institutional review board policies. BioData Catalyst supports collaboration in secure cloud workspaces. Users can control who has access to each workspace, share their hosted data with other workspaces and collaborators, initiate interactive analysis sessions, and launch scalable workflows.
Getting Started on BioData Catalyst
The BioData Catalyst ecosystem is nimble and responsive to the ever-changing conditions of biomedical and data science communities. BioData Catalyst offers elastic computing in that users pay only for what they use, and it facilitates access to many high-value datasets and tools for data discovery. Users do not need to download and manage multiple large datasets or manage computer systems, and a help desk and wealth of support documents and tools are available.
New users can register and log in to BioData Catalyst and view the platform’s open-access data (e.g., 1000 Genomes Project) using their NIH eRA Commons identification information and then register for each BioData Catalyst platform separately. Those seeking to view and use controlled-access data need dbGaP approval for access to those data. BioData Catalyst provides information on how to submit a request to dbGaP for access to controlled-access data.
Users do not pay for the storage of hosted datasets, but they do pay for computation and storage of derived results. BioData Catalyst offers resources to help users include data costs in grant applications. The services page provides information on connecting to and learning about the platforms and services available in the BioData Catalyst ecosystem. BioData Catalyst is building a community of practice to collaboratively solve technical and scientific challenges. Peer-to-peer mentoring and a community forum are available, and office hours will soon be launched. In addition, webinars will be offered, and a Slack channel will be created for informal discussions. Users who join the ecosystem will receive an introductory email with links to resources, including the BioData Catalyst Getting Started Guide. New users begin by registering for the BioData Catalyst Community, which does not require an eRA Commons account. The community provides updates on new releases, upcoming events, and a range of opportunities.
Data Challenges Across Multiple Datasets and Novel Computational Methods
The computing and engineering community creates new data science algorithms but does not necessarily have expertise in the meaning of the data that it works with. Disciplinary barriers are high, and biomedical and data science departments are often on different campuses. Partnerships with the data science community require an understanding of the differences between this community and the biomedical science community as well as how to make collaborations successful.
Examples were provided to demonstrate the value of AI and ML for biomedical research:
- A data-driven stratification and prognosis system for traumatic brain injury
- A system that uses wearable sensors, smartphones, and other technology in the home to monitor family eating dynamics and support development of interventions to reduce obesity
Key challenges for applying AI in health care were identified:
- Research using very small datasets that have biases and other weaknesses
- Differences in approaches between computer scientists and biomedical researchers in how to collect, store, and analyze data
- Privacy issues raised by the large amounts of health care data needed for some computer science and engineering applications
Biomedical researchers sometimes perceive AI and ML as black boxes. The onus of educating biomedical researchers about data science should not be the sole responsibility of data scientists; researchers also need to be proactive about educating themselves. For example, data scientists can create a list of features in an AI system for electronic health records (EHRs) to help physicians understand that the system provides the capabilities that they are seeking.
Safeguards needed before deploying systems that use EHR data in clinical practice are to identify the biases in health and biomedical data and develop algorithms to address these biases. EHR data have real value, but the question is how to learn fundamental principles from high-quality data so that AI systems can be deployed with EHR data that is lower quality and less consistent. Validation is always critical to ensure that an algorithm works as intended.
Toward Machine Learning for Personalized Health Care
ML has the opportunity to have a major impact in health care, but a challenge is to identify the right problems to address with ML and how to deploy ML so that it has the desired benefits. AI is useful for identifying patterns (e.g., in radiology images or genetic data), and investigators are developing methods to use AI to blend data from different sources and to offer real-time and understandable predictions. Challenges for AI in health care include hidden variables (such as social determinants of health) and how to code evolving diagnoses.
Gaps in ML include the need for better models, evaluation metrics, and ways to protect privacy (e.g., by using synthetic and aggregated data) and ensure trustworthiness (e.g., by making ML systems explainable, interpretable, fair, and robust). Different groups can have biases about health care access and quality of care, and these biases can be addressed in different ways.
ML systems can use fairness tools, which are designed to understand and mitigate differences in metrics across groups. Alternatively, data can be preprocessed to add fairness properties, and ML systems can be trained to adjust their predictions to ensure fairness. Most methods to address fairness in ML models are designed to ensure that these models behave similarly in different groups (e.g., genders and racial/ethnic groups). However, defining these groups in different ways could lead models to behave differently.
Many initiatives are trying to improve health care system data because data quality issues are often the source of challenges in ML systems. The fastest way to solve many fairness questions is to collect better data that have fewer disparity issues from the start.
Collecting data from a single source raises challenges with privacy and intellectual property. Federated learning is a set of tools that can be used to build and deploy ML models without collecting data in a single place, but these systems have challenges with data harmonization.
Knowledge-Guided Machine Learning in Disease Prediction with Big and Complex Data
Knowledge-guided ML is an important tool for constructing statistical predictive models of HLBS disease outcomes with large and complex datasets. Traditional statistical regression and modeling methods are often inappropriate when there are a large number of potential disease predictors, such as complex biomarker and imaging observations. ML tools, including random forests, regularized regression, and neural networks could be used to select biomarkers and imaging parameters for more accurate prediction of disease outcomes. For example, by combining ML tools with regression methods, accurate statistical models could be established to predict mortality and disease progression in patients with SCD and other HLBS disorders.
Machine Learning Tools for Synergistically Mining Complex Data and Prior Knowledge
A physics-informed neural network (PINN) is a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. A PINN uses at least two neural networks that share parameters and can be used in biomedical and physical applications. PINNs have been used to predict ruptures of brain aneurysms. PINNs can also be used to predict hypoglycemia, which can be fatal within 30 minutes in people with diabetes. In the database that NHLBI PIs effectively trained PINN with less than 10% of available data points corresponded with hypoglycemia.. Mixup is another method that linearly combines two samples to create a new sample.
OpenRBC was described as a software which is capable of simulating red blood cell lipid bilayer and cytoskeleton as modeled by multiple millions of mesoscopic particles. For example, it can identify sickle-shaped cells in people with SCD and monitor how blood cells clusters develop and circulate. It can also measure blood viscosity in silico using multiscale modeling. Thus, it has potential use in testing new therapies for SCD.
From Transcript to Tissue: Synthesis and Interpretation of Data Across Multiple Scales
There are opportunities to combine gene expression studies with physics-based models of disease. Gene ontology studies can provide a transcriptional profile of disease and reduce complex, poorly understood mechanisms to a small set of biological processes.. Research on Marfan syndrome shows how physics-based models can also shed light on disease processes. A physics-based model of Marfan shows the applied loads, changes in pressure and flow, and how these applied loads would affect the arterial wall at a tissue level (the level that is observed clinically with imaging) in people with Marfan syndrome who have hypertension. Physics-based models can be used for the fluid mechanics and for the wall mechanics, which can show the geometry and microstructural composition. Confidence in these physics-based models is high, but they need to be connected to the available transcriptional and even proteomic information. Thus, combining transcriptomics, proteomics, cell signaling, quantitative imaging, and computational methods can provide unprecedented insight into diseases like Marfan syndrome.
Interpreting Results from Genome-wide Searches—Experiences from TOPMed
The Trans-Omics for Precision Medicine (TOPMed) program, sponsored by the National Institutes of Health (NIH) National Heart, Lung and Blood Institute (NHLBI) aims to provide disease treatments tailored to an individual’s unique genes and environment. For example, TOPMed data have been used to search for gene variants associated with red blood cell quantitative traits. TOPMed data have been used to search for gene variants associated with red blood cell quantitative traits. In an analysis of WGS data from nearly 62,653 people, the investigators found 14 single variants at 12 genomic loci that were associated with seven red blood cell quantitative traits. Almost all WGS P-values use large-sample approximations. In many cases, approximations are good enough, but this is not the case in WGS, where signals are very small and the accuracy of approximations matters. Therefore, aggressive filtration is important. Saddle point approximations can eliminate spurious signals by improving approximations.
Importance of Diversity in STEM
National Science Board member Dr. Victor McCrary talked about the power of tapping into America's diversity to help address many of the challenges in STEM training. The Board serves as an independent body of advisors to both the President and Congress on policy matters related to science and engineering and education in science and engineering. Vision 2030 lays out what the National Science Board believes the United States must do to remain the world innovation leader in 2030. The Vision Roadmap lays out actions needed to achieve the vision and has four goals:
- Foster a global science and engineering community.
- Be competitive and attractive domestically.
- Deliver benefits from research.
- Develop STEM talent for America.
NSF’s 2021 report on Women, Minorities, and People with Disabilities in Science and Engineering shows that scientists and engineers with one or more disabilities have an unemployment rate greater than that of the U.S. labor force. In addition, Black science and engineering doctoral recipients report higher rates of paying for their graduate education through personal funds, including loans, than other racial groups, which sets up inequitable long-term financial burdens.
CONCLUSION
NHLBI hopes to broaden the pool of investigators with access to its data science resources; increase the diversity of these investigators; and bring people in data science, social science, and biological systems together to form teams that can address research questions related to HLBS priorities. Data science has the potential to enable the NHLBI research community to answer new questions, in ways that were previously not possible, to create meaningful impact for the health of the Nation.
Workshop Steering Committee
Erin Iturriaga, D.N.P., M.S.N., R.N., Division of Cardiovascular Sciences, NHLBI
Jonathan Kaltman, M.D., Division of Cardiovascular Sciences, NHLBI
Asif Rizwan, Ph.D., Division of Blood Diseases and Resources, NHLBI
Colin Wu, Ph.D., Division of Intramural Research, NHLBI
Speakers
(view Program Book pdf for Speaker Bios)
Seung Hoan Choi, Ph.D., Broad Institute of MIT and Harvard
Michelle Daya, Ph.D., University of Colorado Denver
Gary Gibbons, M.D., NHLBI
David Goff, M.D., Ph.D., NHLBI
Li-Yueh Hsu, D.Sc., Clinical Center, NIH
Jay Humphrey, Ph.D., Yale University
Erin Iturriaga, D.N.P., M.S.N., R.N., NHLBI
Jonathan Kaltman, M.D., NHLBI
George Em Karniadakis, Ph.D., Brown University
Sanmi Koyejo, Ph.D., University of Illinois at Urbana-Champaign
Sweta Ladwa, M.P.H., P.M.P., NHLBI
Alison Leaf, Ph.D., SevenBridges
Victor McCrary, Jr., Ph.D., University of the District of Columbia; National Science Board
Wendy Nilsen, Ph.D., NSF
Dandi Qiao, Ph.D., Brigham and Women’s Hospital
Danda Rawat, Ph.D., Howard University
Ken Rice, Ph.D., University of Washington
Asif Rizwan, Ph.D., NHLBI
Dave Roberson, SevenBridges
Beth Sheets, M.S., UC Santa Cruz Genomics Institute
Xin Tian, Ph.D., NHLBI
Amber Voght, Renaissance Computing Institute at UNC
Fayuan Wen, Ph.D., Howard University
Colin Wu, Ph.D., NHLBI
Workshop Recordings
Day 1:
Day 2: