Digital tools for effective virus research

27. January 2022

The “Serratus” cloud-computing infrastructure enables researchers to effectively search public sequence databases for biological viruses. So far, more than 130,000 new RNA viruses have been identified – from coronaviruses to relatives of the hepatitis D virus and bacteriophages. The international team behind the project, which also includes researchers from the Heidelberg Institute for Theoretical Studies, reports the findings in the journal “Nature.”

The diversity of viruses on our planet is almost unknown, as so far only a small fraction of existing viruses are known to science. The current SARS-CoV-2 pandemic has shown what devastating consequences emerging viral diseases have for mankind. It is therefore critical to categorize the global diversity of viruses with the aid of methods from computer science and make it usable for science.

Random finds in the rainforest

Public sequence databases have become a vast repository of genetic data, with contributions from researchers around the world. These data come from biological research groups that generate sequence data, whether to study the soil microbiome of the Amazon rainforest or to study the spread of diseases such as the SARS-CoV-2 virus. Typically, such studies obtain genetic sequence data not only from the intended target organism, but also from other organisms whose genetic material happens to be included in the sample. Such incidental data may be of particular interest to other researchers because these data are not the focus of the original study and are therefore usually ignored. However, they are still deposited in the public databases.

An infrastructure for efficient searching

To unearth this hidden treasure means that researchers would have to search through immensely large and distributed data sets. This is because the freely accessible public databases contain sequence data on the order of petabytes (i.e.,one million gigabytes). Researchers in the international Serratus project have developed a cloud-based infrastructure for this purpose. Serratus is an open-source cloud computing infrastructure that is able to perform petabyte-scale sequence alignment.

“Our infrastructure enables efficient searching of the Sequence Read Archive, one of the most popular public sequence repositories,” explains co-author Pierre Barbera from the Computational Molecular Evolution group at the Heidelberg Institute for Theoretical Studies (HITS). He developed software to calculate and analyze the phylogenetic trees of all the species studied. Researchers at the Max Planck Institute for Biology in Tübingen are also involved in the project. They contributed their biocomputing software “DIAMOND” to the project, which, like a search engine, lists matches of protein building blocks of sequenced organisms in just a few hours. Until recently, such calculations required months even with high-performance computers and the previous gold standard BLAST. The enhanced version “DIAMOND v2” is being developed in collaboration with the Max Planck Computing and Data Facility in Garching.

Also involved in the project are scientists from the Institut Pasteur (Paris, France), the University of St. Petersburg (Russia), the University of Valencia, the University of British Columbia (Canada) and UC Berkeley (USA). The corresponding author of the study is bioinformatician Artem Babaian (now at University of Cambridge, UK).

Number of newly discovered viruses increased tenfold

Using the tools developed, the researchers were able to identify more than 130,000 new RNA viruses, a tenfold increase in the number of known virus species. These included previously unknown members of the coronavirus family related to the SARS-CoV-2 virus, novel viruses related to the hepatitis D virus, and novel bacteriophages, viruses that specifically target bacteria.

The results of their study have now been published in the journal Nature. The data from the project is open source and can also be found on the website www.serratus.io , so that researchers can access it and study it further.

Edgar, R.C., Taylor, J., Lin, V. et al. Petabase-scale sequence alignment catalyses viral discovery. Nature, 26 January 2022.
DOI: 10.1038/s41586-021-04332-2 / https://www.nature.com/articles/s41586-021-04332-2

Scientific contact:

Dr. Pierre Barbera
Heidelberg Institute for Theoretical Studies (HITS)
pierre.barbera@h-its.org

Media contact:
Dr. Peter Saueressig
Head of Communications
Heidelberg Institute for Theoretical Studies (HITS)
Phone: +49-6221-533-245
peter.saueressig@h-its.org

About HITS

HITS, the Heidelberg Institute for Theoretical Studies, was established in 2010 by physicist and SAP co-founder Klaus Tschira (1940-2015) and the Klaus Tschira Foundation as a private, non-profit research institute. HITS conducts basic research in the natural, mathematical, and computer sciences. Major research directions include complex simulations across scales, making sense of data, and enabling science via computational research. Application areas range from molecular biology to astrophysics. An essential characteristic of the Institute is interdisciplinarity, implemented in numerous cross-group and cross-disciplinary projects. The base funding of HITS is provided by the Klaus Tschira Foundation.

Name	Borlabs Cookie
Provider	Eigentümer dieser Website
Purpose	Speichert die Einstellungen der Besucher, die in der Cookie Box von Borlabs Cookie ausgewählt wurden.
Cookie Name	borlabs-cookie
Cookie Expiry	1 Jahr

Accept	Matomo
Name	Matomo
Provider	HITS gGmbH
Purpose	Cookie von Matomo für Website-Analysen. Erzeugt statistische Daten darüber, wie der Besucher die Website nutzt.
Cookie Name	_pk_.
Cookie Expiry	13 Monate

Accept	Facebook
Name	Facebook
Provider	Meta Platforms Ireland Limited, 4 Grand Canal Square, Dublin 2, Ireland
Purpose	Wird verwendet, um Facebook-Inhalte zu entsperren.
Privacy Policy	https://www.facebook.com/privacy/explanation
Host(s)	.facebook.com

Accept	Google Maps
Name	Google Maps
Provider	Google Ireland Limited, Gordon House, Barrow Street, Dublin 4, Ireland
Purpose	Wird zum Entsperren von Google Maps-Inhalten verwendet.
Privacy Policy	https://policies.google.com/privacy
Host(s)	.google.com
Cookie Name	NID
Cookie Expiry	6 Monate

Accept	Instagram
Name	Instagram
Provider	Meta Platforms Ireland Limited, 4 Grand Canal Square, Dublin 2, Ireland
Purpose	Wird verwendet, um Instagram-Inhalte zu entsperren.
Privacy Policy	https://www.instagram.com/legal/privacy/
Host(s)	.instagram.com
Cookie Name	pigeon_state
Cookie Expiry	Sitzung

Accept	OpenStreetMap
Name	OpenStreetMap
Provider	Openstreetmap Foundation, St John’s Innovation Centre, Cowley Road, Cambridge CB4 0WS, United Kingdom
Purpose	Wird verwendet, um OpenStreetMap-Inhalte zu entsperren.
Privacy Policy	https://wiki.osmfoundation.org/wiki/Privacy_Policy
Host(s)	.openstreetmap.org
Cookie Name	_osm_location, _osm_session, _osm_totp_token, _osm_welcome, _pk_id., _pk_ref., _pk_ses., qos_token
Cookie Expiry	1-10 Jahre

Accept	Twitter
Name	Twitter
Provider	Twitter International Company, One Cumberland Place, Fenian Street, Dublin 2, D02 AX07, Ireland
Purpose	Wird verwendet, um Twitter-Inhalte zu entsperren.
Privacy Policy	https://twitter.com/privacy
Host(s)	.twimg.com, .twitter.com
Cookie Name	__widgetsettings, local_storage_support_test
Cookie Expiry	Unbegrenzt

Accept	Vimeo
Name	Vimeo
Provider	Vimeo Inc., 555 West 18th Street, New York, New York 10011, USA
Purpose	Wird verwendet, um Vimeo-Inhalte zu entsperren.
Privacy Policy	https://vimeo.com/privacy
Host(s)	player.vimeo.com
Cookie Name	vuid
Cookie Expiry	2 Jahre