Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain
the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in
Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles
and JavaScript.
Genome India sequences will help build a reference genome panel for the country's population. Credit: Andrew Brookes/Connect Images/Getty Images
In a high-security data centre in northern India, genetic blueprints from 10,000 individuals across 83 ethnic groups are stored under layers of encryption. This vast dataset, collected through the country’s largest genomic sequencing exercise, could reshape the future of medicine and science for its citizens.
The information is anonymized and double-blinded, ensuring that even researchers analyzing the data cannot trace sequences back to individuals. As India builds its own genetic reference library, the Genome India project is laying the foundation for precision medicine tailored to its diverse population.
Yet, with AI revolutionizing genomic research and sequencing technologies expanding, experts warn that consent models and privacy safeguards must evolve to keep pace.
“Even the Genome India team doesn’t know the origin of the samples they receive for sequencing or analysis,” says Suchita Ninawe, an advisor at India’s Department of Biotechnology.
The dataset, archived at the Indian Biological Data Center in Faridabad, spans India’s four major linguistic families — Indo-European, Dravidian, Austro-Asiatic, and Tibeto-Burman — offering a crucial baseline for disease risk prediction, pharmacogenomics, and precision medicine.
With over 4,600 distinct groups, many of them endogamous, researchers aimed to maximize genetic diversity while ensuring strict privacy protections under the Biotech PRIDE guidelines.
The project’s first phase uncovered more than 135 million genetic variations, including seven million novel variants absent from global genomic databases. Many of these mutations have direct clinical significance, potentially influencing disease predispositions and drug responses.
The sequences will also contribute to a reference genome panel for India, helping researchers fill in missing variants when analyzing genomes sequenced at low coverage, explains Kumarasamy Thangaraj, a population geneticist and joint national coordinator for Genome India.
Strict access controls and industry limits
Currently, access to the dataset is limited to academic researchers, who must apply through a managed access system. Sensitive details, including individual identities, caste, and tribal affiliations, remain strictly off-limits.
According to Thangaraj, only allele frequency data, which indicates how common a genetic variant is, may be made publicly available. Most genomic sequences will require rigorous approvals, governed by the Framework for Exchange of Data (FeED) Protocols.
Private companies may collaborate if their research aligns with national interests, but direct access to human genome data remains restricted. Future policy changes may allow some level of access, Ninawe adds.
The next phase of Genome India will expand sequencing efforts to include disease-specific studies, focusing on rare disorders, cancer, lifestyle diseases, and neurological conditions.
AI and the future of genomic privacy
While large-scale genomic sequencing for India’s entire population remains technologically and logistically impractical in the near future, advances in AI-driven data analysis could soon accelerate discoveries. But as AI improves, concerns about data privacy and consent models are intensifying.
Login or create a free account to read this content
Gain free access to this article, as well as selected content from this journal and more on nature.com