September 12, 2005
TABLE OF CONTENTS
- Criteria for Selecting Cohorts to be Genotyped
- Informatics Needs/Data Management
- Statistical Analysis Issues
- Data Sharing, Access, Consent, Confidentiality, Reporting
- Approaches to Genotyping, Assessment of Platforms, Quality Control
- Sample Acquisition and Types
- Overall Recommendations
- Working Group Members
Translating the growing understanding of the genetics of complex heart, lung, blood, and sleep disorders into effective diagnostic tools and treatments is a high priority for the National Heart, Lung, and Blood Institute (NHLBI). Such translation must begin with a strong knowledge base, to provide a firm foundation for discovery research and for generation and testing of new hypotheses that can contribute directly to development of diagnostic tools and therapeutics. The Institute has a long tradition of supporting population-based longitudinal cohort studies, in which extensively phenotyped individuals sampled from the general population are followed at length for the development of disease. Addition of extensive genotyping information to these cohorts would provide unparalleled resources for assessing genetic contributions to heart, lung, blood, and sleep disorders. Many of NHLBI?s current cohort studies appear suitable for typing the hundreds of thousands of single nucleotide polymorphisms (SNPs) needed for genome-wide association (GWA) studies. However, these studies include a variety of disease endpoints, phenotypic definitions, and environmental measures; thus, there are challenges in combining the non-genetic components of the databases.
To address the need for translation of genomic discoveries to heart, lung, blood, and sleep disorders research and clinical care, NHLBI recognizes the importance of developing a portfolio of mechanisms for investigators to conduct GWA studies (and possibly selective or complete resequencing) in NHLBI cohort and clinical studies. An expert working group was convened on September 12, 2005, to address these important goals.
The working group was asked to recommend policies and approaches for a transparent process for GWA study selection, determination of genotyping and data base platforms, and data sharing and analysis. Specifically, the group was asked to address:
- objectives of GWA studies and DNA resequencing in population-based cohorts and clinical studies
- criteria for selecting cohorts and studies for GWA analysis
- technology platforms ? assessment, quality control, reproducibility and validation
- bioinformatic needs and data management
- statistical analysis
- data sharing, access, consent, confidentiality, reporting
- sample acquisition and storage.
NHLBI Director Dr. Elizabeth Nabel opened the conference by recognizing the importance of completing the genotyping quickly and efficiently so that the data can get out to the scientific community as soon as possible for use in discovery research. Large-scale projects worth considering should:
- be science driven, but not necessarily strictly hypothesis-limited
- be ripe for a high throughput approach
- substitute a comprehensive cost-effective (low unit cost) strategy for an inefficient cottage-industry approach
- empower the broad scientific community to carry out research more efficiently
- facilitate entirely new approaches to research problems
- be sufficiently compelling scientifically that scientists will want to work on them
- encourage interdisciplinary approaches
- emphasize technology development
- inspire support as a ?signature? initiative for human health
- have ambitious but achievable milestones, to which investigators are held accountable
- have a rigorous plan for quality assurance (QA) and quality control (QC)
- provide early access to data and materials for the entire scientific community
- attract additional partners from both public and private sectors, both nationally and internationally
- be alert to social consequences, and be prepared to address them
- be managed creatively but firmly, with a rigorous scientific advisory process and ethics oversight
- have a plan for phase out.
Each major issue discussed by the working group and its recommendations are summarized below.
Criteria for Selecting Cohorts to Be Genotyped
A major challenge to GWA studies using existing cohorts is how to combine genotypic and phenotypic data in these multiple datasets; that is, how to mine the observational data and localize disease genes relative to a set of markers, relying on non-random associations between alleles and linkage disequilibrium. Thus, the two worlds of molecular genetics, which is producing reliable and comprehensive genetic markers for high-throughput genome-wide genotyping, and population-based and clinical epidemiology, which is expert in defining and measuring phenotypes and disease outcomes, must be wedded. In addition, major issues related to comparability of phenotyping (and perhaps, to a lesser degree, genotyping) across studies must be addressed. Key subgroups of phenotypes, such as mortality, morbidity, subclinical disease, and risk factors, may be defined and assessed with varying degrees of rigor. Simply combining endpoints across studies will not work.
In selecting a cohort or cohorts, the following practical questions must be addressed:
- Should the entire cohort be used or just informative subsets?
- Should longitudinal data be used; are phenotypic change data and incident events important?
- Should there be focus on more efficiently informative subsets, such as case-cohort or case-control subgroups or case-only designs?
- Are appropriate biospecimens available from the cohort?
- Have the informed consent issues been adequately addressed?
- Can collaboration be readily accomplished, across multiple steering committees or in the presence of ongoing genetic marker studies within these cohorts?
- Will identification of clinically relevant findings necessitate informing participants?
- Are the data and samples accessible to other researchers?
- Should there be a focus on related or unrelated individuals?
In addition, several scientific and clinical parameters must be assessed, not the least of which is deciding whether the strategy should focus on achieving a few key successes early on (that is, by focusing on a few genes or a few clearly-defined phenotypes) or engaging in a broad frontal attack aimed at discovering associations among numerous genes and phenotypes. Similarly, is the goal to detect and predict individual risks or to estimate or detect effects and define mechanisms? A key consideration is the ability to replicate associations within and among cohorts. This is especially critical because there is an underlying concern that many of the initial reports of gene-disease associations in the current literature appear to be spurious or non-reproducible.
Study design options for whole genome association and replication include: a) type the whole cohort; b) utilize nested case-control studies within a cohort; c) use a case-cohort design, with a random sample from the cohort serving as a single control group for multiple case groups; d) utilize longitudinal phenotypes; and e) include family studies or related individuals, particularly trios. Advantages of family studies include the greater ability to control for population stratification, the greater ease of QA/QC for genotyping, and the more limited environmental variation, which increases the chance of unmasking a genetic effect. Family studies also provide an opportunity to assess linkage and association in family-based association testing. Phenotypic information and age distributions tend to be more limited in families, however, and their generalizability may be questionable. Several ongoing NHLBI cohort studies have family studies linked to them, including the Strong Heart Study, the Multi-Ethnic Study of Atherosclerosis (MESA), and the Insulin Resistance Atherosclerosis Study (IRAS), while the three cohorts of the Framingham study provide a unique three-generation design with many phenotypes measured at the same ages in all three groups.
One promising approach starts with a subset that is extensively genotyped to identify high-priority candidate SNPs or regions, followed by narrower genotyping on a larger subgroup, the full remaining cohort, or other cohorts. This staged approach is likely to provide cleaner and more reliable answers, particularly since no single cohort is likely to be sufficient in size or diversity to provide definitive answers. Each cohort, in effect, is its own population subset and replication or extension of findings outside that cohort is critical.
It will also be important to determine whether the aim is to detect a genetic effect that may be limited to a small subgroup or to study a putative effect in the population at large. GWA may be useful both for detecting effects and for replicating findings from candidate gene or comparative genomic studies. Although early successes will be important, the approach to finding them will be very different from a long-range strategic approach that is more likely to be needed given the complexity of common heart, lung, blood, and sleep disorders. Both approaches will be important.
Criteria for cohort selection should include appropriateness and breadth of phenotypes, stored samples, breadth of informed consent, public access, and cohort composition in terms of race/ethnicity, age, sex, and family structure. Phenotype selection should consider the phenotype?s potential public health impact, such as relative contribution to overall risk of disease (on a relative risk or attributable risk basis) or importance in intermediate pathways, as well as the availability of a comparably-phenotyped second group for replication. Probably no single cohort will have the diversity of race/ethnicity and phenotypic characterization needed for a comprehensive assessment. In any case, the phenotypes selected should be clearly defined, reliably measured, and of potential public health importance. Their relationship to genetic effects, in terms of pathways and mechanisms, should be well understood. To improve the chances of an early success, a focus on phenotypes that are highly heritable, of early onset, or that involve an atypical risk profile may be best, but such an approach may limit the potential importance in a general population or public health context. Quantitative phenotypes measured with a high degree of precision can be dichotomized into extremes to search for Mendelian subforms, a strategy that has proven successful in the past, though again findings may be more likely to be limited in population-wide impact.
It will be important to focus not just on clinical outcomes and subclinical disease, but also on risk factors. It will thus be important to choose a phenotype that is closer to the genetic level (such as HDL rather than Myocardial Infarction). In addition, common traits should be selected that influence disease burden, so that the study can answer more than one question, for example, elucidating intermediate traits.
Cohorts of persons with monogenic diseases may have advantages in assuring early success of the methodologies. Use of the samples from the prospectively studied Cooperative Study of Sickle Cell Disease has already illustrated a proof of principle for these approaches. Sabastiani and colleagues (Nat. Genet. 2005; 37:435-440) started with 235 SNPs in 80 candidate genes and used a Bayesian network approach to ultimately identify 31 SNPs in 12 genes that were associated with the occurrence of stroke. Among these were genes that are good candidates for stroke in sickle cell disease, including genes in the TGF-beta pathway and fetal hemoglobin.
Finally, it will be important to recognize the impact that the magnitude of the anticipated hazard ratio will have on ability to detect an effect. Smaller studies should suffice for genes with hazard ratios of 2 to 3, but their population-wide importance may be limited. Early successes may be more likely for genes with higher hazard ratios, but they will be less common and potentially less important in the long term. Finding a genetic variant that contributes to a disease or trait, even if that variant is rare, may be useful in identifying new therapeutic targets.
- Family studies and related individuals within a cohort can be useful as either a first or back-up step.
- Avoid selecting a single phenotype given richness of phenotypic information in existing NHLBI cohorts, but consider focus on phenotypes that have a high likelihood of success and are clinically relevant.
- Enriching case groups with particular characteristics or exposures may help to promote early successes but these should be clinically relevant.
- Use a case-cohort design where appropriate, rather than multiple case-control groups.
- Select cohorts with clearly defined phenotypes and adequate measures of environmental exposures and other covariates.
- Select cohorts that present opportunities for replication studies.
- Consider the study of monogenic diseases with well-characterized phenotypes.
- Consider the potential for assessing pharmacogenetic responses, as genetic variation in drug response has potential clinical importance.
Informatics Needs/Data Management
The goal of any large-scale effort is to store, protect, and share data. The key questions are: who will store the data and how will they be made available? Other considerations include the ability to integrate data with analytic tools and provide access to analytic results for further integration with other results. Query capabilities will be critical. Open source should be a goal and analytic tools will be constantly evolving so there should be a capability for continual updates of querying software.
The advantage to typing the entire genome is that it allows for the identification of specific pathways and linkage with proteomic datasets. The advantage of a centralized databank or portal system is that investigators can stay current on other studies of specific genes or chromosomal regions. A centralized system allows investigators to find out about ongoing studies and phenotypes, provides the ability to access data and integrate capabilities, and could result in the development of improved genome-wide assessment tools. It would also build synergy and facilitate better design and interpretation of these studies. However, any centralized system has to perform in a way that retains the strengths of individual studies rather than devolving to the lowest common denominator.
Differences in patient descriptions and data values across cohorts impede the potential for conducting meta-analyses and aggregating results. Attention to how data are represented, ideally using formats and vocabularies drawn from national standards such as the National Health Information Network, will facilitate the shared goal of leveraging knowledge and genomic associations.
In selecting cohorts for GWA studies it will be important to assess whether candidate studies are managing and curating their phenotypic, clinical, and environmental data in an ongoing fashion. Studies proposing to utilize GWA data should define a limited dataset, be hypothesis-guided, and specify the primary data points. Studies that cross cohorts will require selection of data that are comparable.
- Create a centralized repository of high quality datasets and facilitate distributed analysis and rapid access to the data.
- Rapid replication of findings will require uploading of initial results early on.
- Select cohorts based on the quality of available data and documentation.
- Establishing and maintaining a data repository may require different skills from those required for analysis; making the data widely available may be the best way to ensure effective analysis.
Statistical Analysis Issues
One of the challenges facing GWA studies is the need for better approaches for handling case-control/case-cohort data, even in terms of simple questions such as whether analysis should focus on alleles or genotypes. This is complicated by pleiotropy and the collection of data on continuous traits. Another issue to be resolved is how best to test for Hardy-Weinberg equilibrium (HWE) in large datasets, and what to do with genotypes that violate HWE. Although HWE errors may represent genotyping errors, this is less likely to be true if there is an excess of heterozygotes, and may in fact suggest an allele of potential clinical importance. Departures from HWE can arise from a variety of causes, some of which we do not understand. Population structure can inflate the differences among samples, thus the design will have to combine data over multiple SNPs to eliminate the noise (some of which is statistical and some of which is biological).
- Put the data in a central repository for many to analyze; consider creating a limited data set common to multiple cohorts.
- Create reference panels of representative population samples that will help in validating initial tests of association.
- Account for race/ethnic differences and potential population stratification, and determine how well population subgroups are represented in the current databases.
- Consider using known positive associations for benchmarking, QC, and comparison across cohorts.
- Don?t mandate the questions that should be addressed.
Data Sharing, Access, Consent, Confidentiality, Reporting
Data Sharing: Mechanisms for limited access data sharing (LADS) already are in place for the NHLBI cohorts and have been implemented for ancillary studies through NHLBI limited access data sets policies. However, challenges remain in achieving greater efficiency in data sharing, particularly across different cohorts. Data formats and vocabularies vary by study, the tools to support open access for effective sharing of data can be improved, and technical support will be required. In particular, criteria for defining phenotypic, behavioral, and environmental data, as well as their documentation, vary by study. This requires a significant effort in data integration, the development of unified vocabularies and metadata, and agile modeling of study protocols and of population attributes.
Further, subject matter experts and technical support will be needed for mapping and integrating data, for navigating study protocols and databases, and for enabling researchers of diverse backgrounds to make effective use of this resource. At a more modest level of effort, coordination with each of the selected NHLBI cohorts will be required, primarily the respective coordinating centers.
Access: NHLBI policies for access to biospecimens and data are being implemented through distribution agreements modeled after established prototypes. NHLBI cohorts appear to have incorporated these policies in their distribution of data and specimens internal to the study and in providing access to such resources to ancillary studies. Material and data distribution agreements (MDA/DDA agreements) continue to be updated and are not likely to be uniform across NHLBI cohorts.
These NHLBI policies seem appropriate and sufficient as a basis for a data access mechanism for non-profit and for-profit agencies, as they have been implemented through agreements with numerous users of NHLBI data. Standards are now established for data confidentiality and for periodic reporting and sharing of genotyping data generated by outside users. Timing of data sharing may also be an issue; GWA genotyping data may be made available very quickly but the LADS process for phenotypic and other data is much slower. Consideration should be given to making early GWA genotyping contingent on early release of the full non-genetic data set.
Informed Consent Issues: Standards of informed consent practices have evolved over the course of most current NHLBI cohorts. Updating and repeating the consent process has brought most cohorts to contemporaneous standards. However, cohort members lost to follow-up and those who experienced fatal or disabling events tend to have older versions of informed consent. In addition, heterogeneity in IRB practices has introduced diversity in the specifics of consent within cohorts. For example, some IRBs require so-called ?tiered consent,? a consent process that allows subjects to specify allowable uses of their specimens or medical information (such as use only for the current project, or only for specified phenotypes or categories of morbid conditions). Such tiered consent forms increase the specificity of participants? consent but this information must be captured and automated to track allowable uses of specimens and materials over time.
Confidentiality: Considerable experience exists in safeguarding participant confidentiality in the context of LADS preparation. Distribution agreements and withholding or categorical collapsing of potentially identifying information are established procedures for LADS preparation in NHLBI cohort studies, as implemented through their coordinating centers. These confidentiality procedures are centrally coordinated in the Division of Epidemiology and Clinical Applications and are currently applied to many of the NHLBI cohorts under consideration for GWA studies. These procedures can serve as a prototype for data sharing in NHLBI-supported GWA studies.
Reporting of Results: Typically, informed consent specifies that results from genotyping will not be reported to participants or that only results of established value for medical diagnosis or treatment will be reported. Some heterogeneity by cohort and study site exists. Expectations for what constitutes results that require notification are likely to evolve; it would be reasonable to anticipate the need for a mechanism to oversee and manage this process.
The investigators of the respective source cohort have the responsibility to report to participants any finding deemed to require notification (whether the finding originates from an NHLBI sponsored genotyping center or from material shared with other investigators). Local IRBs differ in their specifications for the procedures to be followed for this purpose. An NIH-wide policy on results reporting is needed.
- Data Sharing:
- Establish procedures to assemble, map, and reconcile demographic and phenotypic data for the selected cohorts.
- Develop unified data vocabularies and metadata, linked to measurement protocol trees.
- Develop a library of data comparability assessments.
- Establish a library of analytic tools to facilitate classification, transformation, reduction and retrieval of the pooled data.
- Develop and maintain user-friendly data access and security procedures.
- Adapt current versions of MDA/DDA agreements for access to specimens and data in GWA studies of NHLBI cohorts.
- Improve the inventory of sample informed consent forms for NHLBI cohorts selected for GWA studies and track subsequent revisions.
- Map items that affect sharing of biospecimens and data according to each cohort member?s latest consent version.
- Apply the Limited Access Data Sets (LADS) policy.
- Identify the requirements for reporting research results to subjects, particularly with regard to what should be specified during the consent process.
- Existing LADS confidentiality procedures can serve as the basis for an equivalent approach for the GWA studies of NHLBI cohorts.
- Reporting of Results:
- Identify requirements for reporting results as specified in the informed consent version applicable to each cohort.
- If the obligation to report results is not uniformly waived, identify the level at which proposed candidate genes/integration of analyses will be reviewed in light of reporting guidelines.
- If the previous step is not practical, identify a mechanism to review genetic variants in the database for compliance with reporting guidelines.
- Establish a mechanism to channel results deemed to require notification to the respective cohort study investigators.
- Encourage the development of an NIH-wide policy on reporting of results to study participants.
Approaches to Genotyping, Assessment of Platforms, Quality Control
The amount of data being generated through ongoing projects, such as HapMap, is overwhelming. We can now rapidly assay hundreds of thousands of SNPs. We also can measure the performance of laboratories and technologies. And, costs have dropped. These trends combine to promise that the technology will continue to get better. This is an appropriate time for NHLBI to invest GWA studies.
In making the investment it will be important to ensure that investigators have access to multiple genotyping platforms for conducting whole genome scans. Each company approaches whole genome scans in a different way; taken together, these provide synergy. These companies typically offer pre-determined sets of SNPs that will be on the shelf and available to investigators, which in the aggregate will reduce the noise problem. However, these sets are not optimal because they do not always include SNPs from diverse populations.
Upfront Issues (pre-genotyping): Consider the requirements for sample preparation, such as: How much DNA is needed? How sensitive is the genotyping platform to sample quality? Is it successfully used when whole genome amplified? What steps are in place to judge DNA adequacy? Does the repository have adequate DNA handling and tracking capabilities?
Processing Considerations: What checks are done to track samples through the lab and to promote QA and QC, such as call rates and checking for internal controls? Is there a rapid requeuing process for samples that need to be re-run? Is there monitoring of DNA plates throughout the process and is there assessment of batch effects?
Downstream (post-genotyping): What are the requirements for data curation and validation, such as QC checks, Hardy-Weinberg checks, differential call rates, stratification of samples, checking for artifacts, developing fine mapping panels, correctly identifying strandedness? Are there facilities for re-genotyping candidate hits with different platforms?
- Now is the time to undertake whole genome scans. The HapMap has provided experience that can be applied to GWA studies, but issues surrounding ensuring the quantity and quality of the DNA have to be addressed.
- Validation needs to be possible across genotyping platforms.
- Picking the optimal SNP set is important but NHLBI should not wait for the perfect set to become available.
- Do not require every center to have expertise in DNA preparation.
- Insist that genotype facilities have rigid QC in place.
- Minimum DNA amounts are probably in the 500 nanogram to 1 microgram range with whole genome amplification, but stated DNA concentrations are often unreliable so it may be best to request 1 microgram.
- DNA quality is a consideration but probably samples degraded to fragments of 2KB can still be adequately genotyped.
Sample Acquisition and Types
Phenotypically and genetically, the science is moving from measuring individual molecules to assessing pathways. This means that the full complement of protein, lipid, carbohydrate, and nucleic acid data will be needed. In addition, we will need to be able to stimulate pathways and watch them respond in living cells and tissue.
It will be important to consider the frequency of sample collection. Should there be multiple collections for each ?time point?? How can we minimize analytical and within subject variance? What is the value of multiple time points for longitudinal data?
Samples will have to be protected against protease and nuclease inhibition and oxidation so that metabolic pathways remain intact.
- Serum, plasma, and urine are, and will remain, useful. RNA and DNA are critical but require special preparation methods. Platelets are important for mtDNA. Living tissue increasingly is important. Although blood cells serve this need, many other biopsy samples can be obtained, such as adipose tissue, muscle, skin. Whole cells are needed for telomere analysis.
- Research is needed on biopsy methods (e.g., thin needle), rapid sample preparation, and establishing pathway-based phenotypes (e.g., effects of sample acquisition, preparation and storage, usefulness of cell lines for retained phenotypes).
- Research is needed on preparing cells so that metabolic pathways remain intact. This assumes that we know the pathways that are active in vivo, have assays for these pathways, and can perform ex vivo modifications. We also need method development for preserving metabolic activities during shipping.
- Repositories need better coordination to make it easier for researchers to access and use samples from multiple banks that are diverse and representative of the population. This requires commonality and standardization of methods, development of a comprehensive database of what is available, and improved communication among repositories.
- Because the phenotypic work required for GWA studies is complex, training facilities will be needed for technical staff.
NHLBI should embark on large-scale genotyping and resequencing projects. The entire NHLBI community of heart, lung, blood, and sleep researchers should be invited to participate in the genomic revolution.
NHLBI should use existing cohorts to assemble the best data available for the purposes of GWA studies. These cohort studies have collected a large amount of phenotypic and exposure data, providing the opportunity to conduct many experiments, including evaluating gene candidates and positional cloning.
More than one cohort should be included, forcing resolution of data and repository harmonization and centralization issues. In selecting cohorts, consideration should be given to diversity by age, sex, race, and ethnicity, as well as availability of related individuals and opportunities for replication.
Well-defined and quantified phenotypes of public health importance that cut across NHLBI?s research areas should be selected for study. Phenotypes should be selected that permit use of several population samples to obtain cases and controls.
NHLBI should encourage immediate access to data held in a centralized data repository.
Requests for proposals or applications should ask investigators to design a set of experiments that demonstrate how this approach will lead to the discovery of genes of certain size, effect, and relevance; that is, such experiments should be hypothesis-guided. Both near-term and long-term successes are needed.
A mixture of study designs should be supported, including studies of families, trios, and unrelated individuals; studies of cases and controls as well as full cohorts; all with the goal of finding associations. There may be a need for pilot studies.
It will be important to get primary information about the cohorts out to the community quickly, such as numbers of study participants with certain phenotypes, so investigators can reasonably prepare applications. Investigators should be able to query rapidly for numbers of participants with a given event.
Avoid prescribing only NHLBI-supported or known cohorts but allow wide consideration of the best available cohorts that meet the selection criteria.
Working Group Members
Eric Boerwinkle, Ph.D.
Gregory L. Burke, M.D., M.S.
Aravinda Chakravarti, Ph.D.
Christopher G. Chute, M.D., Dr.P.H.
Mark J. Daly, Ph.D.
Seth E. Dobrin, Ph.D.
Kelly A. Frazer, Ph.D.
Stacey Gabriel, Ph.D.
Gary Gibbons, M.D.
Gerardo Heiss, M.D.
Eduardo Marbán, M.D.
Betty S. Pace, M.D.
Susan Redline, M.D., M.P.H.
Christine Seidman, M.D.
Belgaum S. Thyagaraja, Ph.D.
Russell P. Tracy, Ph.D.
Bruce S. Weir, Ph.D.
Scott T. Weiss, M.D., M.S.