New phylogenetic tool can handle SARS-COV-2 data load
Researchers at UC San Diego, in collaboration with UC Santa Cruz, have developed a new software tool to track and map the evolution of the SARS-CoV-2 virus, capable of handling the unprecedented amount of genetic data generated by rapidly spreading evolving pathogen. The software is used to efficiently and accurately track new variants of this virus on what is called a phylogenetic tree: a visual history or map of an organism’s genetic changes and variations over time and geography. Using this new optimization tool, called matOptimize, researchers are now able to more accurately track the SARS-CoV-2 viral genome, map new variants on the phylogenetic tree as they go. of their development and to follow the evolutionary and transmission dynamics of the virus.
The tool was described in the log Bioinformatics, with Cheng Ye, an undergraduate computer engineering student at UC San Diego, as lead author. Learn more about Ye’s journey to research as an undergraduate student and his experience working on such a timely project, in this Q&A.
“With over 10 million SARS-CoV-2 genome sequences now available, maintaining an accurate and complete phylogenetic tree of all available SARS-CoV-2 sequences becomes impossible to calculate with existing software, but is essential to get a detailed picture of the virus. “evolution and transmission,” write the researchers, under the direction of UC San Diego electrical and computer engineering professor Yatish Turakhia, in the paper.
Currently, the program used for the phylogeny of SARS-CoV-2 is called UShER: Ultrafast Sample placement on Existing tRee. UShER was developed by Turakhia as a postdoctoral researcher at UC Santa Cruz, and is used by UC Santa Cruz to maintain the phylogeny of SARS-CoV-2. It is publicly viewable at – https://taxonium.org/?backend=https://api.cov2tree.org.
A few months into the pandemic, UShER faced the challenge of adding new genetic sequences to the tree; the team was adding sequences in stages, one at a time, but when the genetic sequence input was incorrect or ambiguous, the system lost accuracy.
“UShER would make a guess: an educated guess, but still a guess,” Turakhia said.
Thus, these sequences would sometimes be placed suboptimally on the tree, producing false mutations. In order to refine these placements, a tree optimization method was needed. However, existing tree optimizers have not been able to track the amount of SARS-CoV-2 genetic data generated, with currently 10 million sequences mapped and up to 100,000 sequences added daily.
It was then that Turakhia worked with Ye and other students in his lab on the challenge of creating a better tree optimizer. Ye had joined Turakhia’s lab under the Summer Electrical and Computer Engineering Research Internship (SRIP) program in January 2021. When it became clear to Turakhia that Ye’s basics in data structures , parallel algorithms, programming and bioinformatics were quite solid, he entrusted him to play a leading role in this task.
“I was initially assigned to work on speeding up sequence alignment on graphics processing units, but I thought the SARS-COV-2 phylogeny project might be more exciting, and indeed it was. the case,” Ye said.
“At that time [Cheng] became an expert in tree optimization,” said Turakhia.
Many existing tree optimizers were closed source, so Ye was forced to work with what was available in the literature to devise a solution to the data challenge. After a few months of research, Ye developed matOptimize, currently the only tool capable of tracking the rapidly changing amount of SARS-CoV-2 genetic data.
To achieve this, Ye created true parallel software, with processing distributed across multiple processors, and a significantly lower memory requirement. This allows it to be scaled to the level of data required in the SARS-CoV-2 phylogeny.
Today, UShER as a phylogenetic tree software and matOptimize as a tree optimization method, are used together to characterize the phylogeny of SARS-CoV-2. There is now a comprehensive catalog of genetic sequences that, from phylogenetic inferences, are highlighted as more dangerous or transmissible sequences that scientists at UC San Diego and UC Santa Cruz continue to track.
Going forward, Turakhia’s team is using this information to study SARS-CoV-2 recombination, a phenomenon that could lead to dangerous new variants.
“In collaboration with Professor Russell Corbett-Detig’s group at UC Santa Cruz, Cheng and I have developed software called RIPPLES, which can sensitively detect recombinants in data sets 1000 times larger,” Turakhia said. . “This software will help monitor the emergence of new recombinant SARS-CoV-2 and will likely be applied to other pathogens in the future.