Only a tenth of the human genome is studied

Technology

ONE tale of Nasreddin, a self-satirising 13th-century philosopher, tells of the time he lost a precious ring. When his wife asks why he is searching in the yard rather than inside, where the ring was lost, Nasreddin explains that the light is better outside. Looking for something where the search is easiest is a form of bias now known as the “street light” effect. A study published this week in PLOS Biology reports a similar skew in modern genetics that may be leaving thousands of important genes largely unstudied.

There are roughly 20,000 genes in the human genome. Understanding genes and the proteins they encode can help to unravel the causes of diseases, and inspire new drugs to treat them. But most research focuses on only about ten percent of genes. Thomas Stoeger, Luis Amaral and their colleagues at Northwestern University in Illinois used machine learning to investigate why that might be.

First the team assembled a database of 430 biochemical features of both the genes themselves (such as the levels at which they are expressed in different cells) and the proteins for which they code (for example, their solubility). When they fed these data to their algorithm, they were able to explain about 40% of the difference in the attention paid to each gene (measured by the number of papers published) using just 15 features. Essentially, there were more papers on abundantly expressed genes that encode stable proteins. That suggests researchers—perhaps not unreasonably—focus on genes that are easier to study. Oddly, though, the pattern of publication has not changed much since 2000, despite the completion of the human genome project in 2003 and huge advances in DNA-sequencing technology.

One possible reason for that can be found in another phenomenon known as the “Matthew effect”. Pithily summarised by the adage “the rich get richer”, this predicts that researchers and money will flow to subjects that are already well-established. To see if this was the case, the team added the year of each gene’s discovery to their model and found its explanatory power jumped to 56%, because earlier discoveries translated into greater attention. The identification of a new human gene is often preceded by the discovery of similar genes in scientific workhorses such as fruit flies, rats and mice. When the researchers added the number of papers relating to these animal genes, the algorithm’s predictive powers improved even further, to 76%.

All this might be justified if the most-studied genes were also the most important—if, for instance, mutations within them are associated with serious or common diseases. The team found that the most-researched 10% of genes were indeed between three and five times more likely to be involved in disease. But they receive disproportionate attention, accruing thousands of times the number of publications as the least-researched 10%.

The team found these biases were reproduced in funding decisions made by America’s National Institutes of Health, the world’s biggest sponsor of biomedical research; they also found a similar pattern in drug development in the private sector. Drugs are often made to tweak the behaviour of the proteins that particular genes encode. Although there are presently drugs in development for 30% of disease-associated genes discovered before 1981, the same is true for only 2% of genes discovered since 2001.

No doubt much remains to be learned about even the best-studied genes. But the upshot of all this is that a wealth of discoveries and treatments is likely to await scientists, and funding agencies, bold enough to look elsewhere. Time to shine a light on the darker parts of the genome.