Human Genomics Research Has A Diversity Problem

Precision medicine promises to tailor the diagnosis and treatment of disease to your unique genetic makeup. A doctor may use the presence of certain genetic markers to diagnose a disease, or choose one drug for treatment over another.

But the studies that link genetic markers with disease focus largely on white European populations and neglect other races and ethnicities, according to an analysis published in the journal Cell on Thursday. The researchers argue this lack of diversity in genomic studies harms our scientific understanding of the genetic underpinnings of disease in all populations and exacerbates health care inequities.

The analysis reports that 78 percent of all individuals included in genomic studies of disease up to 2018 were of European descent, 10 percent Asian, 2 percent African, 1 percent Hispanic, and less than 1 percent for all other groups.

"That is just unbelievable," says Sarah Tishkoff, an evolutionary geneticist at the Perelman School of Medicine at the University of Pennsylvania who was an author of this analysis. "It really limits our understanding."

Ignoring genomic diversity can mean missing out on information that could benefit all. For example, the authors of the study point to PCSK9, a gene important for regulating cholesterol. Studying mutations that occurred in West African populations provided extra insight into the underlying biology and led to a new class of drugs that benefit people of all races.

"We're really just choosing to miss out on learning all sorts of things about the genome and what it does," says Alice Popejoy, a postdoctoral geneticist at Stanford University not involved in this analysis.

The genetics of disease range from relatively simple to mystifyingly complex. At one extreme are Mendelian diseases, where one gene variant essentially guarantees that you'll have that disease, regardless of your genetic background. Think Huntington's disease or muscular dystrophy.

At the other extreme are diseases where many different genes seem to contribute, alongside environmental factors. Think hypertension or coronary artery disease. The lack of diversity in data sets can be particularly problematic for researchers studying polygenic disease.

Polygenic diseases vastly outnumber Mendelian diseases, making them a top research priority. But for a researcher, trying to identify the genes involved in a polygenic disease is like looking for an unknown number of needles in an enormous haystack.

Imagine our genome as a long line of about 3 billion base pairs, the letters that make up our genetic code. A researcher can use genetic markers, present in most people, to orient herself. These markers pop up at somewhat regular intervals across the whole line of letters.

Our researcher can then conduct what's known as a genome-wide association study or GWAS, where she sequences these genetic markers in thousands of people, some portion of whom have a given disease. To home in on disease-causing genes, she looks for markers that keep popping up in people with the disease. If a marker is strongly associated with presence of the disease, the researcher infers that a disease gene must be nearby.

This conclusion is possible because letters that are close together tend to be linked and inherited as a block that is passed down the generations. The blocks can vary in size, but in general if a marker is associated with a disease, geneticists assume the disease-causing gene is in the same block.

But the authors of this analysis argue that inference can be faulty when comparing markers across different ethnic populations for two reasons. One is that the genes themselves may have changed, either through selection or random chance, in different populations.

For example, Tishkoff cites a gene that's strongly associated with non-diabetic kidney disease. This condition is rare among Europeans, but more common among West Africans. Researchers pinpointed two mutations in a gene that seems to be associated with this disease, and further research suggested that this gene appears at higher frequency in West African populations because it confers some protection against sleeping sickness. Tishkoff says that if we'd only considered European variation, we'd have missed this example of how disease-causing genes can also be beneficial in some environments.

Aside from the genes themselves changing, the genetic markers that act as signposts can get mixed up and rearranged in different populations, according to the authors. In fact, basic evolutionary theory predicts it.

Homo sapiens emerged in Africa approximately 300,000 to 200,000 years ago, leaving the continent much later, in small bursts of migration. Our genomes reflect this history, with Africans harboring much more genetic diversity than any other human population.

Populations with more diversity tend to have smaller blocks of the genome that are linked together, according to Tishkoff. But that blocking pattern can change during a migration event.

Imagine the gene pool of Africa as an actual swimming pool, filled with marbles of every color. "You reach in a grab a handful of marbles, and you're getting a very small subset of that variation," Tishkoff says. Every time a small band of humans left Africa, they carried only a small fraction of that diversity with them, and the populations that stem from those migration events tend to have bigger chunks of the genome linked together.

These different patterns of linkage can spell trouble for comparing across populations, as the markers associated with a disease-causing gene in European populations might exist in a totally different part of the genome in African or Hispanic populations, according to Tishkoff. A marker that accurately tagged a gene that increased risk of heart disease in Europeans might be miles away, genomically speaking, from that same gene in other populations, rendering the marker meaningless.

Tishkoff stresses that ignoring genomic diversity means that right now, genetically informed health care is worse, in some cases, for populations of non-European descent. Polygenic risk scores for diseases, which are calibrated using GWAS studies and can be used to inform treatment, can be less accurate when applied to other populations, leading to false positives, or underestimating the risk of certain diseases.

"There are lots of reasons for health disparities, obviously the biggest player is probably just unequal access to health care," says Tishkoff. "But if we want all people to have max benefit from human genomics research, we need to be including them in the studies."

Popejoy agrees, though she emphasizes that the genetics of health disparities is only a small part of the problem. People "shouldn't get the impression that health disparities are driven by differences in genetic structure between ethnic groups," she says. "Environment matters and widespread systemic and structural racism that exacerbates environmental effects are more important."

Still, both Popejoy and Tishkoff say much more could be done to increase diversity in genomic studies. "We need changes both from the top-down and the bottom-up," Popejoy says.

"Funding agencies need to financially encourage studying ethnically diverse populations," Tishkoff says. "We're already seeing the needle shifting, with initiatives like NIH's All of Us." That research initiative seeks to collect genomic data from diverse populations while making an effort to provide participants with their results.

Given the history of unethical medical research on minority communities, Popejoy says that researchers need to meaningfully engage with people affected by a research agenda. "Researchers need to recognize the value, both scientifically and ethically, in studying diverse populations, but they also need to demonstrate that value to the people they're studying," Popejoy says.

Jonathan Lambert is an intern on NPR's Science Desk. You can follow him on Twitter: @evolambert