Myriad Genetics Blog Software for flexible analysis and interpretation of genetic mutations Software for flexible analysis and interpretation of genetic mutations September 25, 2014 Patients Advances in DNA sequencing technology are changing healthcare in many ways, including allowing us to determine a patient’s DNA sequence cheaper and faster than ever before. With modern technology, we can quickly discover if a patient has an unexpected change in the sequence of one of their genes. Depending on the field, you may see these changes referred to as “mutations”, “variants”, or “alleles”. However, knowing which particular allele is present in a patient is not enough: we must also interpret it to determine whether that particular sequence will have a medical impact on a patient. This “interpretation problem” is an increasing challenge for many labs. At Counsyl, we have a growing variant curation team that researches every single new mutation we discover and determines its impact on a patient’s risk for disease. It turns out a lot of data is needed to do this, and I developed a tool to make this process better. What does a mutation mean for a patient? A mutation in the BRCA1 or BRCA2 genes can increase one’s risk of cancer. For example, in the general population, women have a 12% chance of developing breast cancer but those with a BRCA1 mutation may have as high as an 80% chance. Suppose a patient has had both of their BRCA genes sequenced. A small part of the sequence might read like: Patient's Gene: ...ACGTGC... Known Healthy Gene: ...ACGCGC... If you look carefully, you’ll notice that the two sequences don’t quite match up. The patient has a T at a location where the known healthy gene has a C. Is the patient at a higher risk for cancer? Not necessarily. Answering this question is the job of the Counsyl curation team. A geneticist on the curation team might start by looking for existing consensus on the effects of the allele. Databases such as ClinVar and HGMD contain published peer-reviewed studies on the effects of such alleles. If the allele has been studied thoroughly, the geneticist can confidently diagnose the allele as deleterious (harmful) or benign (not harmful). If the allele has not been studied thoroughly, however, answering the question becomes more difficult. Before my project, geneticists would make this determination using qualitative information such as medical histories of other patients known to have the allele how often the allele occurs, compared to the disease it might cause how the allele affects the cell’s ability to create the protein for which it codes how the allele affects the structure of the protein for which it codes Software to the Rescue The geneticists at Counsyl aren’t alone on the curation team. Software Engineers work hand-in-hand with them to make the curation process more reliable and efficient, in part by providing tools that allow for quantitative analysis of alleles. Computational scores establish a formal quantitative representation of some attribute of an allele, that might otherwise be evaluated qualitatively, such as predicting a mutation’s possible disruption of RNA splicing or detecting whether a mutation changes a site that has not changed for long evolutionary time periods. The benefits of emphasizing quantitative analysis are twofold: the curation process becomes more efficient. Geneticists on the curation team can make decisions more quickly because they do not have to manually process raw data. the curation process becomes more reliable. The scores create a frame of reference on which consistent standards and processes can be built. Diagnosing alleles via quantitative analysis is still a new idea in industry. Many of the software packages used to calculate quantitative scores for alleles are just emerging from academia; the challenges of distributing the software packages and integrating their outputs into a production workflow are largely unsolved. For example, some software packages are only made available upon request from the original authors. However, systematizing the installation of such algorithms and the computation of their scores promises to enable many new approaches. A hot area of research right now is the development of machine learning methods, such as CADD, which can consider a large variety of functional scores in order to automatically predict a mutation’s effect. The Annotations Service I spent my internship at Counsyl building the annotations service, which manages all the third party allele-scoring software packages, dubbed annotators calculates aggregate statistics on the annotations caches annotations in a database exposes an API and command line interface through which users and other software can query annotations for individual alleles and aggregate statistics In turn, this service has allowed us to incorporate many scores into the Curation workflow. One use of the annotations service is generating visualizations, such as histograms of scores for known benign and deleterious alleles (Figure 1). For example, by comparing a novel allele against previously curated alleles, we see that the PhyloP score suggests a deleterious classification. Additionally, this service allows us to easily explore and visualize many more scores as well as various combinations of them. Figure 1. Distribution of PhyloP scores across Counsyl’s known deleterious (red) and benign (green) alleles. The score of a novel allele is indicated by a thin blue line. Hybrid Schema/Schemaless Database One of the challenges of collecting annotations from a diverse collection of annotators is finding a useful way to store their output that is flexible enough to support the output of all annotators efficient at finding all annotations for a particular allele efficient at executing arbitrary aggregate queries on the annotations able to store billions of annotations able to enforce uniqueness (there should only be one annotation per allele per annotator) The annotations service satisfies all these requirements using a hybrid schema/schemaless database; the database engine enforces some constraints on each record (the presence of certain columns, the type of certain columns, uniqueness) while also allowing free-form data. This behavior is implemented by storing annotations in a PostgreSQL JSON column — a column type in that accepts any JSON object, stores the object efficiently, and allows efficient queries to be executed on the object’s contents. Because it is just a PostgreSQL table, we can still define constraints and keys on all other columns as usual. chromosome offset reference sequence allele sequence annotator annotations 5 3342342 G T VEP {“PhyloP”: 5.5, …} 13 32900276 G C ESP {“frequency”: 0.0117} Annotators Running in Docker Containers Because deploying and running bleeding-edge academic annotators can be challenging, each executes in its own Docker container. Docker is a library for managing Linux containers: isolated execution environments much like virtual machines but running in the host OS, sharing software packages when possible. Docker allows us great flexibility in configuring each annotator’s execution environment without sacrificing performance. Big Picture As the amount of DNA being sequenced increases, the bottleneck in many labs is becoming the interpretation of the sequences, especially the curation of never-before-seen alleles. The more we move in the direction of quantitative analysis, the easier the job of the curators becomes: we can precisely define protocols based on the scores that reliably determine if an arbitrary mutation is benign or deleterious, reducing the need for curators to analyze data and do research by hand. Although “automatic” quantitative methods cannot replace human curators, the increasing power of these methods will help us keep pace and improve accuracy as the rate of sequencing at Counsyl increases. Lucas Wojciechowski was a software engineering intern at Counsyl and is currently a Computer Science student at Waterloo.