LEE LICHTENSTEIN1, JONN SMITH1, DAVID BENJAMIN1, AARON CHEVALIER1, KRISTIAN CIBULSKIS1, JULIAN HESS1 , SAMUEL K. LEE1 , IGNATY LESHCHINER1 , DIMITRI LIVITZ1, DANIEL ROSEBROCK1, VALENTIN RUANO-RUBIO1 , TAKUTO SATO1 , ANDREY SMIRNOV1, CHIP STEWART1 , GAD GETZ1,2 , ERIC BANKS1. Broad Institute of MIT and Harvard, Cambridge, MA
Abstract
Somatic small mutations, SNVs or Indels, and copy number alterations are the two categories of mutations with the largest impact on cancer tumors. The Broad Institute has released somatic variant calling workflows for small mutations (M2) and copy number alterations (ModelSegments) based on the Genome Analysis Toolkit (GATK). The suite of workflows can call variants in capture or whole-genome sequencing data and will include functional annotations (Funcotator), such as protein change (for small variants) and impacted gene (for all variants). Common artifacts in sequencing data, such as those arising from oxidative DNA damage, FFPE/deamination, or mapping errors, are corrected automatically. Evaluation of the workflows is standardized and repeatable, which allows tracking of performance across versions, both detection performance (e.g. sensitivity, precision), as well as runtime performance (e.g. CPU and RAM usage). A matched normal is not required for a given tumor sample, since the workflows can leverage pre-processed panels of normals (PoNs). The workflows are freely available, are portable (i.e. can be run on local, on-prem, or cloud compute), are optimized for cost reduction, and can be tuned to optimally leverage available compute.The measured sensitivity of M2 was at least 0.93 for small somatic nucleotide variants (SNVs) and 0.83 for small insertions/deletions (Indels) on DREAM1, DREAM2, and DREAM3 challenges, and on a titrated mixture of germline samples (>=100x depth, AF = 0.2). The measured precision of M2 ranged from 0.91 to 0.98 on DREAM1, DREAM2, and DREAM3 for both SNVs and Indels. The false positive rate (FPR) of M2 was between 0.03 and 0.21 FP/Mb for SNVs, and between 0.0 and 0.1 FP/Mb for indels, on twelve paired, replicate normal-normal samples. The cost of the M2 workflow is about USD$1.15 for a pair of 35x WGS matched tumor-normal samples, using Google Cloud Compute, and required about 32 hours of CPU time on a single core with 3GB RAM.
The measured sensitivity of ModelSegments was at least 0.91 for deletions and amplifications across three cohorts of TCGA whole-exome samples (Stomach adenocarcinoma N=39, Thyroid carcinoma N=50, and Lung adenocarcinoma N=60). The measured specificity for the same set of cohorts was at least 0.96 for both deletions and amplifications. All results reported here were using the corresponding SNP Array results as a truth set.
GATK MS cost was approximately USD$0.65 on a 30x WGS pair using Google Cloud Compute and required about 6 hours of CPU time with a single core. The RAM usage was varied automatically in the workflow to minimize cost, but was in the range of 2-13GB.