Working Group on Genome Wide Association in NHLBI Cohorts
September 12, 2005
TABLE OF CONTENTS
Translating the growing understanding of the genetics of complex heart, lung, blood, and sleep disorders into effective
diagnostic tools and treatments is a high priority for the National Heart, Lung, and Blood Institute (NHLBI). Such
translation must begin with a strong knowledge base, to provide a firm foundation for discovery research and for generation
and testing of new hypotheses that can contribute directly to development of diagnostic tools and therapeutics. The
Institute has a long tradition of supporting population-based longitudinal cohort studies, in which extensively phenotyped
individuals sampled from the general population are followed at length for the development of disease. Addition of
extensive genotyping information to these cohorts would provide unparalleled resources for assessing genetic contributions
to heart, lung, blood, and sleep disorders. Many of NHLBI’s current cohort studies appear suitable for typing the hundreds
of thousands of single nucleotide polymorphisms (SNPs) needed for genome-wide association (GWA) studies. However, these
studies include a variety of disease endpoints, phenotypic definitions, and environmental measures; thus, there are
challenges in combining the non-genetic components of the databases.
To address the need for translation of genomic discoveries to heart, lung, blood, and sleep disorders research and
clinical care, NHLBI recognizes the importance of developing a portfolio of mechanisms for investigators to conduct GWA
studies (and possibly selective or complete resequencing) in NHLBI cohort and clinical studies. An expert working group
was convened on September 12, 2005, to address these important goals.
The working group was asked to recommend policies and approaches for a transparent process for GWA study selection,
determination of genotyping and data base platforms, and data sharing and analysis. Specifically, the group was asked to
- objectives of GWA studies and DNA resequencing in population-based cohorts and clinical studies
- criteria for selecting cohorts and studies for GWA analysis
- technology platforms – assessment, quality control, reproducibility and validation
- bioinformatic needs and data management
- statistical analysis
- data sharing, access, consent, confidentiality, reporting
- sample acquisition and storage.
NHLBI Director Dr. Elizabeth Nabel opened the conference by recognizing the importance of completing the genotyping
quickly and efficiently so that the data can get out to the scientific community as soon as possible for use in discovery
research. Large-scale projects worth considering should:
- be science driven, but not necessarily strictly hypothesis-limited
- be ripe for a high throughput approach
- substitute a comprehensive cost-effective (low unit cost) strategy for an inefficient cottage-industry approach
- empower the broad scientific community to carry out research more efficiently
- facilitate entirely new approaches to research problems
- be sufficiently compelling scientifically that scientists will want to work on them
- encourage interdisciplinary approaches
- emphasize technology development
- inspire support as a “signature” initiative for human health
- have ambitious but achievable milestones, to which investigators are held accountable
- have a rigorous plan for quality assurance (QA) and quality control (QC)
- provide early access to data and materials for the entire scientific community
- attract additional partners from both public and private sectors, both nationally and internationally
- be alert to social consequences, and be prepared to address them
- be managed creatively but firmly, with a rigorous scientific advisory process and ethics oversight
- have a plan for phase out.
Each major issue discussed by the working group and its recommendations are summarized below.
Back to Table of Contents
Criteria for Selecting Cohorts to Be Genotyped
A major challenge to GWA studies using existing cohorts is how to combine genotypic and phenotypic data in these
multiple datasets; that is, how to mine the observational data and localize disease genes relative to a set of markers,
relying on non-random associations between alleles and linkage disequilibrium. Thus, the two worlds of molecular genetics,
which is producing reliable and comprehensive genetic markers for high-throughput genome-wide genotyping, and population-based
and clinical epidemiology, which is expert in defining and measuring phenotypes and disease outcomes, must be wedded.
In addition, major issues related to comparability of phenotyping (and perhaps, to a lesser degree, genotyping) across
studies must be addressed. Key subgroups of phenotypes, such as mortality, morbidity, subclinical disease, and risk
factors, may be defined and assessed with varying degrees of rigor. Simply combining endpoints across studies will not work.
In selecting a cohort or cohorts, the following practical questions must be addressed:
- Should the entire cohort be used or just informative subsets?
- Should longitudinal data be used; are phenotypic change data and incident events important?
- Should there be focus on more efficiently informative subsets, such as case-cohort or case-control subgroups or
- Are appropriate biospecimens available from the cohort?
- Have the informed consent issues been adequately addressed?
- Can collaboration be readily accomplished, across multiple steering committees or in the presence of ongoing
genetic marker studies within these cohorts?
- Will identification of clinically relevant findings necessitate informing participants?
- Are the data and samples accessible to other researchers?
- Should there be a focus on related or unrelated individuals?
In addition, several scientific and clinical parameters must be assessed, not the least of which is deciding whether the
strategy should focus on achieving a few key successes early on (that is, by focusing on a few genes or a few
clearly-defined phenotypes) or engaging in a broad frontal attack aimed at discovering associations among numerous genes and
phenotypes. Similarly, is the goal to detect and predict individual risks or to estimate or detect effects and define
mechanisms? A key consideration is the ability to replicate associations within and among cohorts. This is especially
critical because there is an underlying concern that many of the initial reports of gene-disease associations in the current
literature appear to be spurious or non-reproducible.
Study design options for whole genome association and replication include: a) type the whole cohort; b) utilize nested
case-control studies within a cohort; c) use a case-cohort design, with a random sample from the cohort serving as a single
control group for multiple case groups; d) utilize longitudinal phenotypes; and e) include family studies or related
individuals, particularly trios. Advantages of family studies include the greater ability to control for population
stratification, the greater ease of QA/QC for genotyping, and the more limited environmental variation, which increases the
chance of unmasking a genetic effect. Family studies also provide an opportunity to assess linkage and association in
family-based association testing. Phenotypic information and age distributions tend to be more limited in families,
however, and their generalizability may be questionable. Several ongoing NHLBI cohort studies have family studies linked
to them, including the Strong Heart Study, the Multi-Ethnic Study of Atherosclerosis (MESA), and the Insulin Resistance
Atherosclerosis Study (IRAS), while the three cohorts of the Framingham study provide a unique three-generation design with
many phenotypes measured at the same ages in all three groups.
One promising approach starts with a subset that is extensively genotyped to identify high-priority candidate SNPs or
regions, followed by narrower genotyping on a larger subgroup, the full remaining cohort, or other cohorts. This staged
approach is likely to provide cleaner and more reliable answers, particularly since no single cohort is likely to be
sufficient in size or diversity to provide definitive answers. Each cohort, in effect, is its own population subset and
replication or extension of findings outside that cohort is critical.
It will also be important to determine whether the aim is to detect a genetic effect that may be limited to a small
subgroup or to study a putative effect in the population at large. GWA may be useful both for detecting effects and for
replicating findings from candidate gene or comparative genomic studies. Although early successes will be important, the
approach to finding them will be very different from a long-range strategic approach that is more likely to be needed given
the complexity of common heart, lung, blood, and sleep disorders. Both approaches will be important.
Criteria for cohort selection should include appropriateness and breadth of phenotypes, stored samples, breadth of
informed consent, public access, and cohort composition in terms of race/ethnicity, age, sex, and family structure.
Phenotype selection should consider the phenotype’s potential public health impact, such as relative contribution to
overall risk of disease (on a relative risk or attributable risk basis) or importance in intermediate pathways, as well as
the availability of a comparably-phenotyped second group for replication. Probably no single cohort will have the diversity
of race/ethnicity and phenotypic characterization needed for a comprehensive assessment. In any case, the phenotypes
selected should be clearly defined, reliably measured, and of potential public health importance. Their relationship to
genetic effects, in terms of pathways and mechanisms, should be well understood. To improve the chances of an early
success, a focus on phenotypes that are highly heritable, of early onset, or that involve an atypical risk profile may be
best, but such an approach may limit the potential importance in a general population or public health context.
Quantitative phenotypes measured with a high degree of precision can be dichotomized into extremes to search for Mendelian
subforms, a strategy that has proven successful in the past, though again findings may be more likely to be limited in
It will be important to focus not just on clinical outcomes and subclinical disease, but also on risk factors. It will
thus be important to choose a phenotype that is closer to the genetic level (such as HDL rather than Myocardial Infarction). In addition,
common traits should be selected that influence disease burden, so that the study can answer more than one question, for
example, elucidating intermediate traits.
Cohorts of persons with monogenic diseases may have advantages in assuring early success of the methodologies.
Use of the samples from the prospectively studied Cooperative Study of Sickle Cell Disease has already illustrated a
proof of principle for these approaches. Sabastiani and colleagues (Nat. Genet. 2005; 37:435-440) started with 235 SNPs
in 80 candidate genes and used a Bayesian network approach to ultimately identify 31 SNPs in 12 genes that were associated
with the occurrence of stroke. Among these were genes that are good candidates for stroke in sickle cell disease,
including genes in the TGF-beta pathway and fetal hemoglobin.
Finally, it will be important to recognize the impact that the magnitude of the anticipated hazard ratio will have on
ability to detect an effect. Smaller studies should suffice for genes with hazard ratios of 2 to 3, but their
population-wide importance may be limited. Early successes may be more likely for genes with higher hazard ratios, but they
will be less common and potentially less important in the long term. Finding a genetic variant that contributes to a
disease or trait, even if that variant is rare, may be useful in identifying new therapeutic targets.
- Family studies and related individuals within a cohort can be useful as either a first or back-up step.
- Avoid selecting a single phenotype given richness of phenotypic information in existing NHLBI cohorts, but
consider focus on phenotypes that have a high likelihood of success and are clinically relevant.
- Enriching case groups with particular characteristics or exposures may help to promote early successes but
these should be clinically relevant.
- Use a case-cohort design where appropriate, rather than multiple case-control groups.
- Select cohorts with clearly defined phenotypes and adequate measures of environmental exposures and other
- Select cohorts that present opportunities for replication studies.
- Consider the study of monogenic diseases with well-characterized phenotypes.
- Consider the potential for assessing pharmacogenetic responses, as genetic variation in drug response has
potential clinical importance.
Back to Table of Contents
Informatics Needs/Data Management
The goal of any large-scale effort is to store, protect, and share data. The key questions are: who will store the data
and how will they be made available? Other considerations include the ability to integrate data with analytic tools and
provide access to analytic results for further integration with other results. Query capabilities will be critical. Open
source should be a goal and analytic tools will be constantly evolving so there should be a capability for continual updates
of querying software.
The advantage to typing the entire genome is that it allows for the identification of specific pathways and linkage with
proteomic datasets. The advantage of a centralized databank or portal system is that investigators can stay current on
other studies of specific genes or chromosomal regions. A centralized system allows investigators to find out about ongoing
studies and phenotypes, provides the ability to access data and integrate capabilities, and could result in the development
of improved genome-wide assessment tools. It would also build synergy and facilitate better design and interpretation of
these studies. However, any centralized system has to perform in a way that retains the strengths of individual studies
rather than devolving to the lowest common denominator.
Differences in patient descriptions and data values across cohorts impede the potential for conducting meta-analyses
and aggregating results. Attention to how data are represented, ideally using formats and vocabularies drawn from national
standards such as the National Health Information Network, will facilitate the shared goal of leveraging knowledge and
In selecting cohorts for GWA studies it will be important to assess whether candidate studies are managing and curating
their phenotypic, clinical, and environmental data in an ongoing fashion. Studies proposing to utilize GWA data should
define a limited dataset, be hypothesis-guided, and specify the primary data points. Studies that cross cohorts will
require selection of data that are comparable.
- Create a centralized repository of high quality datasets and facilitate distributed analysis and rapid access to the
- Rapid replication of findings will require uploading of initial results early on.
- Select cohorts based on the quality of available data and documentation.
- Establishing and maintaining a data repository may require different skills from those required for analysis; making
the data widely available may be the best way to ensure effective analysis.
Back to Table of Contents
Statistical Analysis Issues
One of the challenges facing GWA studies is the need for better approaches for handling case-control/case-cohort data,
even in terms of simple questions such as whether analysis should focus on alleles or genotypes. This is complicated by
pleiotropy and the collection of data on continuous traits. Another issue to be resolved is how best to test for
Hardy-Weinberg equilibrium (HWE) in large datasets, and what to do with genotypes that violate HWE. Although HWE errors
may represent genotyping errors, this is less likely to be true if there is an excess of heterozygotes, and may in fact
suggest an allele of potential clinical importance. Departures from HWE can arise from a variety of causes, some of which
we do not understand. Population structure can inflate the differences among samples, thus the design will have to combine
data over multiple SNPs to eliminate the noise (some of which is statistical and some of which is biological).
- Put the data in a central repository for many to analyze; consider creating a limited data set common to
- Create reference panels of representative population samples that will help in validating initial tests of
- Account for race/ethnic differences and potential population stratification, and determine how well population
subgroups are represented in the current databases.
- Consider using known positive associations for benchmarking, QC, and comparison across cohorts.
- Don’t mandate the questions that should be addressed.
Back to Table of Contents
Data Sharing, Access, Consent, Confidentiality, Reporting
Data Sharing: Mechanisms for limited access data sharing (LADS) already are in place for the NHLBI cohorts and
have been implemented for ancillary studies through NHLBI limited access data sets policies. However, challenges remain in
achieving greater efficiency in data sharing, particularly across different cohorts. Data formats and vocabularies vary by
study, the tools to support open access for effective sharing of data can be improved, and technical support will be
required. In particular, criteria for defining phenotypic, behavioral, and environmental data, as well as their
documentation, vary by study. This requires a significant effort in data integration, the development of unified
vocabularies and metadata, and agile modeling of study protocols and of population attributes.
Further, subject matter experts and technical support will be needed for mapping and integrating data, for navigating
study protocols and databases, and for enabling researchers of diverse backgrounds to make effective use of this resource.
At a more modest level of effort, coordination with each of the selected NHLBI cohorts will be required, primarily the
respective coordinating centers.
Access: NHLBI policies for access to biospecimens and data are being implemented through distribution agreements
modeled after established prototypes. NHLBI cohorts appear to have incorporated these policies in their distribution of
data and specimens internal to the study and in providing access to such resources to ancillary studies. Material and data
distribution agreements (MDA/DDA agreements) continue to be updated and are not likely to be uniform across NHLBI cohorts.
These NHLBI policies seem appropriate and sufficient as a basis for a data access mechanism for non-profit and
for-profit agencies, as they have been implemented through agreements with numerous users of NHLBI data. Standards are
now established for data confidentiality and for periodic reporting and sharing of genotyping data generated by outside
users. Timing of data sharing may also be an issue; GWA genotyping data may be made available very quickly but the LADS
process for phenotypic and other data is much slower. Consideration should be given to making early GWA genotyping
contingent on early release of the full non-genetic data set.
Informed Consent Issues: Standards of informed consent practices have evolved over the course of most current
NHLBI cohorts. Updating and repeating the consent process has brought most cohorts to contemporaneous standards. However,
cohort members lost to follow-up and those who experienced fatal or disabling events tend to have older versions of
informed consent. In addition, heterogeneity in IRB practices has introduced diversity in the specifics of consent within
cohorts. For example, some IRBs require so-called “tiered consent,” a consent process that allows subjects to specify
allowable uses of their specimens or medical information (such as use only for the current project, or only for specified
phenotypes or categories of morbid conditions). Such tiered consent forms increase the specificity of participants’
consent but this information must be captured and automated to track allowable uses of specimens and materials over time.
Confidentiality: Considerable experience exists in safeguarding participant confidentiality in the context of LADS
preparation. Distribution agreements and withholding or categorical collapsing of potentially identifying information are
established procedures for LADS preparation in NHLBI cohort studies, as implemented through their coordinating centers.
These confidentiality procedures are centrally coordinated in the Division of Epidemiology and Clinical Applications and are currently applied to many of the NHLBI cohorts under consideration for GWA studies. These procedures can serve as a
prototype for data sharing in NHLBI-supported GWA studies.
Reporting of Results: Typically, informed consent specifies that results from genotyping will not be reported to
participants or that only results of established value for medical diagnosis or treatment will be reported. Some
heterogeneity by cohort and study site exists. Expectations for what constitutes results that require notification are
likely to evolve; it would be reasonable to anticipate the need for a mechanism to oversee and manage this process.
The investigators of the respective source cohort have the responsibility to report to participants any finding deemed
to require notification (whether the finding originates from an NHLBI sponsored genotyping center or from material shared
with other investigators). Local IRBs differ in their specifications for the procedures to be followed for this purpose.
An NIH-wide policy on results reporting is needed.
- Establish procedures to assemble, map, and reconcile demographic and phenotypic data for the selected cohorts.
- Develop unified data vocabularies and metadata, linked to measurement protocol trees.
- Develop a library of data comparability assessments.
- Establish a library of analytic tools to facilitate classification, transformation, reduction and retrieval of
the pooled data.
- Develop and maintain user-friendly data access and security procedures.
- Adapt current versions of MDA/DDA agreements for access to specimens and data in GWA studies of NHLBI cohorts.
- Improve the inventory of sample informed consent forms for NHLBI cohorts selected for GWA studies and track
- Map items that affect sharing of biospecimens and data according to each cohort member’s latest consent version.
- Apply the Limited Access Data Sets (LADS) policy.
- Identify the requirements for reporting research results to subjects, particularly with regard to what should be
specified during the consent process.
- Existing LADS confidentiality procedures can serve as the basis for an equivalent approach for the GWA studies
of NHLBI cohorts.
Reporting of Results:
- Identify requirements for reporting results as specified in the informed consent version applicable to each cohort.
- If the obligation to report results is not uniformly waived, identify the level at which proposed candidate
genes/integration of analyses will be reviewed in light of reporting guidelines.
- If the previous step is not practical, identify a mechanism to review genetic variants in the database for
compliance with reporting guidelines.
- Establish a mechanism to channel results deemed to require notification to the respective cohort study investigators.
- Encourage the development of an NIH-wide policy on reporting of results to study participants.
Back to Table of Contents
Approaches to Genotyping, Assessment of Platforms, Quality Control
The amount of data being generated through ongoing projects, such as HapMap, is overwhelming. We can now rapidly assay
hundreds of thousands of SNPs. We also can measure the performance of laboratories and technologies. And, costs have
dropped. These trends combine to promise that the technology will continue to get better. This is an appropriate time for
NHLBI to invest GWA studies.
In making the investment it will be important to ensure that investigators have access to multiple genotyping platforms
for conducting whole genome scans. Each company approaches whole genome scans in a different way; taken together, these
provide synergy. These companies typically offer pre-determined sets of SNPs that will be on the shelf and available to
investigators, which in the aggregate will reduce the noise problem. However, these sets are not optimal because they do
not always include SNPs from diverse populations.
Upfront Issues (pre-genotyping): Consider the requirements for sample preparation, such as: How much DNA is
needed? How sensitive is the genotyping platform to sample quality? Is it successfully used when whole genome amplified?
What steps are in place to judge DNA adequacy? Does the repository have adequate DNA handling and tracking capabilities?
Processing Considerations: What checks are done to track samples through the lab and to promote QA and QC, such as call
rates and checking for internal controls? Is there a rapid requeuing process for samples that need to be re-run? Is there
monitoring of DNA plates throughout the process and is there assessment of batch effects?
Downstream (post-genotyping): What are the requirements for data curation and validation, such as QC checks,
Hardy-Weinberg checks, differential call rates, stratification of samples, checking for artifacts, developing fine mapping
panels, correctly identifying strandedness? Are there facilities for re-genotyping candidate hits with different
- Now is the time to undertake whole genome scans. The HapMap has provided experience that can be applied to GWA
studies, but issues surrounding ensuring the quantity and quality of the DNA have to be addressed.
- Validation needs to be possible across genotyping platforms.
- Picking the optimal SNP set is important but NHLBI should not wait for the perfect set to become available.
- Do not require every center to have expertise in DNA preparation.
- Insist that genotype facilities have rigid QC in place.
- Minimum DNA amounts are probably in the 500 nanogram to 1 microgram range with whole genome amplification, but
stated DNA concentrations are often unreliable so it may be best to request 1 microgram.
- DNA quality is a consideration but probably samples degraded to fragments of 2KB can still be adequately
Back to Table of Contents
Sample Acquisition and Types
Phenotypically and genetically, the science is moving from measuring individual molecules to assessing pathways.
This means that the full complement of protein, lipid, carbohydrate, and nucleic acid data will be needed. In addition,
we will need to be able to stimulate pathways and watch them respond in living cells and tissue.
It will be important to consider the frequency of sample collection. Should there be multiple collections for each
“time point”? How can we minimize analytical and within subject variance? What is the value of multiple time points for
Samples will have to be protected against protease and nuclease inhibition and oxidation so that metabolic pathways
- Serum, plasma, and urine are, and will remain, useful. RNA and DNA are critical but require special preparation
methods. Platelets are important for mtDNA. Living tissue increasingly is important. Although blood cells serve
this need, many other biopsy samples can be obtained, such as adipose tissue, muscle, skin. Whole cells are needed
for telomere analysis.
- Research is needed on biopsy methods (e.g., thin needle), rapid sample preparation, and establishing
pathway-based phenotypes (e.g., effects of sample acquisition, preparation and storage, usefulness of cell lines
for retained phenotypes).
- Research is needed on preparing cells so that metabolic pathways remain intact. This assumes that we know the
pathways that are active in vivo, have assays for these pathways, and can perform ex vivo modifications. We also
need method development for preserving metabolic activities during shipping.
- Repositories need better coordination to make it easier for researchers to access and use samples from multiple
banks that are diverse and representative of the population. This requires commonality and standardization of
methods, development of a comprehensive database of what is available, and improved communication among
- Because the phenotypic work required for GWA studies is complex, training facilities will be needed for technical staff.
Back to Table of Contents
NHLBI should embark on large-scale genotyping and resequencing projects. The entire NHLBI community of heart, lung,
blood, and sleep researchers should be invited to participate in the genomic revolution.
NHLBI should use existing cohorts to assemble the best data available for the purposes of GWA studies. These cohort
studies have collected a large amount of phenotypic and exposure data, providing the opportunity to conduct many
experiments, including evaluating gene candidates and positional cloning.
More than one cohort should be included, forcing resolution of data and repository harmonization and centralization
issues. In selecting cohorts, consideration should be given to diversity by age, sex, race, and ethnicity, as well as
availability of related individuals and opportunities for replication.
Well-defined and quantified phenotypes of public health importance that cut across NHLBI’s research areas should be
selected for study. Phenotypes should be selected that permit use of several population samples to obtain cases and
NHLBI should encourage immediate access to data held in a centralized data repository.
Requests for proposals or applications should ask investigators to design a set of experiments that demonstrate how this
approach will lead to the discovery of genes of certain size, effect, and relevance; that is, such experiments should be
hypothesis-guided. Both near-term and long-term successes are needed.
A mixture of study designs should be supported, including studies of families, trios, and unrelated individuals; studies
of cases and controls as well as full cohorts; all with the goal of finding associations. There may be a need for pilot
It will be important to get primary information about the cohorts out to the community quickly, such as numbers of study
participants with certain phenotypes, so investigators can reasonably prepare applications. Investigators should be able to
query rapidly for numbers of participants with a given event.
Avoid prescribing only NHLBI-supported or known cohorts but allow wide consideration of the best available cohorts that meet the selection criteria.
Back to Table of Contents
Working Group Members
Eric Boerwinkle, Ph.D.
Gregory L. Burke, M.D., M.S.
Aravinda Chakravarti, Ph.D.
Christopher G. Chute, M.D., Dr.P.H.
Mark J. Daly, Ph.D.
Seth E. Dobrin, Ph.D.
Kelly A. Frazer, Ph.D.
Stacey Gabriel, Ph.D.
Gary Gibbons, M.D.
Gerardo Heiss, M.D.
Eduardo Marbán, M.D.
Betty S. Pace, M.D.
Susan Redline, M.D., M.P.H.
Christine Seidman, M.D.
Belgaum S. Thyagaraja, Ph.D.
Russell P. Tracy, Ph.D.
Bruce S. Weir, Ph.D.
Scott T. Weiss, M.D., M.S.
Back to Table of Contents