Research Roundup: Genomic Data Release Opens New Paths for Discovery

March 17, 2022
An infographic entitled "The Researcher Workbench.” Within the controlled tier, there are more than 98,600 whole genome sequences; more than 165,000 genotyping arrays; more than 593 million unique variants; and genomics analysis tools.

The All of Us Research Program reached an important milestone this week with the release of its initial genomic dataset. Nearly 100,000 whole genome sequences (WGS) and 165,000 genotyping arrays are now available within the Researcher Workbench, less than two years after the beta launch of the platform. Nearly 50% of the data come from participants who self-identify with a racial or ethnic minority group.

This data release not only positions All of Us as one of the most diverse genomic datasets of its kind but adds to the richness of the data already available in the Researcher Workbench. The genomic data are integrated alongside information from surveys, physical measurements, electronic health records (EHRs), and wearable devices, creating a robust resource to inform thousands of studies across different health conditions.

“For the first time, we will have an opportunity to really understand the genetic architecture of health and disease across the diverse makeup of our country’s population,” says All of Us Chief Medical and Scientific Officer Geoff Ginsburg, M.D., Ph.D.

With the unsurpassed dimensionality of data, Dr. Ginsburg hopes that scientists and trainees will be inspired to design innovative research studies that pave the way for individualized prevention and treatment of disease.

Diverse Range of Participant Data and Users

The lack of diversity in large genomic studies to date has had huge impacts, limiting discovery while exacerbating health disparities. All of Us aims to help change that.

“What makes this initial collection of data so special is who it comes from,” says Kelsey Mayo, Ph.D., scientific portfolio and product manager at the Vanderbilt University Medical Center Data and Research Center. “What's going to grab researchers' attention is the diversity of the cohort. Half of our cohort is non-European. More than 90% of participants in genome-wide association studies have been of European descent. There's just a real absence of genetic data from African, Asian, and Latino people. All of Us participants are providing this important data that’s been missing in health research. So we are going to have that new genetic information that's been missing.” 

The All of Us cohort also includes participants who self-identify as American Indian (AI) or Alaska Native (AN), but their data have not yet been added to the Researcher Workbench. AI/AN data are currently undergoing additional review to ensure that those data have been cleared of all identifying information, in keeping with commitments the program made during its nationwide Tribal consultation process. The data will be available in a forthcoming release alongside resources to provide important context for researchers.

Researchers interested in exploring the diversity in the workbench’s latest release can use the public Data Browser to search within the dataset’s 593 million genetic variants. Using this tool, anyone can search for specific genes or variants and see aggregate counts of their frequency in the All of Us dataset, as well as the genetic ancestry of participants with each variant. Participant diversity is a parallel goal to making the centralized, secure, cloud-based platform available to researchers across a wide range of settings and institutions and at all stages of their careers (e.g., students and early stage investigators). This accessibility helps researchers execute rapid, hypothesis-driven research. All of Us is actively working to ensure equity in access in a deliberately inclusive way by creating a demographically diverse researcher cohort. Read more about how All of Us is working to ensure a diverse research workforce.

Controlled Tier Data and Access

Detailed genomic data are accessible through the Controlled Tier, a level of the Researcher Workbench with stricter requirements for access. This level also includes all of the data currently available in the Registered Tier, as well as additional clinical fields in EHRs, and more granular demographic data from both surveys and EHRs. For example, in the Controlled Tier, researchers can find additional International Classification of Diseases (ICD) codes for COVID-19 testing and diagnosis data, real dates of health events, and residential information (first three digits of the ZIP code).

As part of this data and tools refresh, the Researcher Workbench now also has data from the U.S. Census Bureau’s American Community Survey (ACS), which tracks communities’ shifting demographics, including educational attainment, income, language proficiency, disability, employment, and housing characteristics. All of Us anticipates that this will be the first of many linked datasets that will help bring richer insights and utility to the All of Us platform.

The Controlled Tier is available to registered researchers who have completed additional training and whose institutions have an additional data use agreement in place with All of Us. These requirements help All of Us provide broad access to the data while protecting the privacy and security of data shared by the program’s more than 320,000 participants. A list of organizations with institutional agreements in place is available at ResearchAllofUs.org. The registration process is streamlined for individual researchers from organizations that have a completed agreement.

Demonstration Projects

To support the launch of the Controlled Tier, half a dozen research teams helped check data validity and test the platform’s new tools. Bioinformaticist Eric Venner, Ph.D., from the Baylor College of Medicine's Human Genome Sequencing Center, leads one team. He and his colleagues have been looking at genetic variants that affect how people may react to a certain medication, a field known as pharmacogenomics. The team is pairing genetic information with medical information such as EHRs and survey responses to find genetic variants that might indicate a sensitive reaction to a particular drug.

“Then we can ask questions like, 'How common are those variants in our cohort? And are there other high-value things that we could add to the health results we return to participants, things that would really benefit people?'” Dr. Venner says. “Maybe we can look at drug sensitivities in people with different ancestries or people with different disease backgrounds.” 

Feedback from the demonstration teams led to improvements in the workbench’s tools, workflows, and support materials. Feedback also provided key insights into the costs of executing large-scale computational analysis. Currently, All of Us provides $300 in initial credits for compute time for each registered Researcher Workbench user. To aid in planning, cost estimates for each computation are visible within each researcher’s workspace, since initial credits can quickly be surpassed given the size of the genomic data. The All of Us Data and Research Center support team is available to provide suggestions for reducing computational costs for large analyses and can be reached at support@researchallofus.org.

New Data Bring New Opportunities

“This is really just the beginning of it all,” says Dr. Ginsburg. “At the end of the day, we expect to have a million genomes. And this is just a first step in that direction.”

More than 1,500 researchers have already registered to use the Researcher Workbench, and more than 1,150 projects are underway. To learn more and register for access, visit https://www.researchallofus.org/register/.


This article appeared in the March 2022 issue of All of Us Research Roundup. If you would like to receive the bimonthly researcher newsletter directly, you can SUBSCRIBE HERE.

View the full March edition of the All of Us Research Roundup here.