March 4th, 2015
Searching for ancestors online is a popular activity. So is having your DNA sequenced. But merging the two has created a problem; it’s very, very easy to use genealogy software and DNA sequence data to identify people who are supposed to be anonymous.
This means all sorts of data, like health history, race and ethnicity, and even paternity, aren’t really as confidential as they’re supposed to be. It also throws a wrench in the efforts of scientists to provide genetic data to any researcher, for free and without restriction.
Yaniv Erlich, a geneticist at the Whitehead Institute in Massachusetts, started looking at new tools to probe big DNA databases. He found that by just using a string of DNA code and a participant’s age (all available on the database), he could find people by matching that information with public genealogy sites. His work was published in the 18 January issue of Science.
Erlich started his efforts by looking at the 1000 Genomes Project, an international study that collects DNA sequences, ages, and regions where participants reside, and posts that information on the internet. He found that tiny, inherited patterns in DNA on the Y chromosome called “short tandem repeats” could help identify the last names of men in the study. In fact, genealogy websites use these short tandem repeats to help men identify other men with the same last names—in other words, ancestors.
By testing the genome of someone well known (DNA sequencing pioneer Craig Venter, whose DNA is published), Erlich found that he could identify Venter solely by searching the DNA databases. He figured that he could find, just by using DNA sequences, about 12 percent of the men in the project.
He didn’t stop there. Erlich obtained the Y chromosome short tandem repeats from one male participant in the database. By focusing on men from Utah and knowing his participant’s age, he had the region, age and DNA sequence. Genealogy databases quickly gave him the man’s grandfathers on both sides of his family. A quick Google search of the names popped up an obituary, which gave him the names of more relatives—people who were not part of the 1000 Genomes Project.
Erlich’s study has created a storm of sorts in research circles, as well as within government agencies tasked with managing DNA and other supposedly confidential health data. Recently, in fact, the White House issued recommendations on maintaining the privacy of DNA data, while allowing its unrestricted use among researchers. Seems like those two recommendations are veering on a collision course.
Gymrek, M., McGuire, A., Golan, D., Halperin, E., & Erlich, Y. (2013). Identifying Personal Genomes by Surname Inference Science, 339 (6117), 321-324 DOI: 10.1126/science.1229566
Photo: MIKI Yoshihito, Flickr