Bioinformatics - Tools
IGS has developed a number of tools for bioinformatics analyses that are available to the community as compiled binaries or as source code. Some of these include:
A program for analysis of protein functional divergence and prediction of molecular mechanisms.
Rapid detection, classification and accurate alignment of up to a million or more related protein sequences.
Ergatis is a web-based utility used to create, run, and monitor reusable computational analysis pipelines, utilizing the Workflow engine. It contains pre-built components for common bioinformatics analysis tasks. Ergatis is under active development at IGS and is in use at several sequencing centers including the J Craig Venter Institute (JCVI), and the Broad Institute.
Manatee is a web-based tool used to perform manual functional annotation. It has been specifically designed to optimize the ability of curators to evaluate all available sequence-based and experimental data to assign the best possible annotation to a given gene product. Manatee allows users to view, modify, and store annotation through interactions with an underlying relational database where all of the information is stored. Manatee supports the storage of multiple types of functional annotation including protein names, gene symbols, EC numbers, Gene Ontology terms, and associated supporting evidence. In addition, Manatee provides summary views of statistics and information from the genome as a whole.
PhyloTrac is a software package for exploration and analysis of phylogenetic diversity from PhyloChip data. PhyloTrac is capable of displaying data from multiple PhyloChip experiments in a variety of styles, including heatmap, time series/parallel coordinates, probe intensity display, phylogenetic tree, and textual spreadsheets. All views are fully synchronized and dynamic so that selection and filtering in one view is instantaneously reflected in the other views.
Sybil is a web-based tool for visualizing and mining comparative genomic data. Powered by a Chado relational database, Sybil provides a rich set of interfaces for browsing and analyzing data. The tool has been implemented for a variety of organisms both prokaryotes and eukaryotes. Sybil allows users to search for genes or gene clusters of interest and visualize their genomic context. The various displays provide multiple types of genomic comparisons for in-depth data mining, data interrogation from multiple angles, and generation of publication-ready figures. Sybil also gives users the ability to identify core and accessory genes from all or a subset of the available genomes. Most recently a Sybil site has been released to the public for comparison of complete Streptococcus pneumoniae genomes. Strepneumo promises to be an important tool in accelerating vaccine discovery in developing nations.
Sybil is implemented in Perl and built on a tiered architecture that includes an API for retrieving data from Chado. The software also includes utilities for rendering publication quality images in SVG and PDF formats. Sybil is open source and freely available with documentation and demo databases available for download.
Workflow is a Java based, XML driven Workflow Engine suite, which can be used to build, execute and monitor complex process pipelines. This tools serves as the execution engine for the Ergatis tool. Workflow is under active development at IGS.
The Phylomark tool utilizes a whole genome alignment and identifies the minimum number of smaller regions that have a significant phylogenetic signal and can recapitulate the whole genome phylogeny. The use of the tools would be in the screening of large culture collections to focus whole genome sequencing efforts to identify new branches of the tree or fill out regions that are not well represented based on a whole genome approach.
ReVac: Reverse Vaccinology (A new tool for large-scale reverse vaccinology)
I have a long track record of developing computational and analytical approaches to support translational efforts aimed at developing new vaccines and therapeutics. In collaboration with the team of Dr. Rino Rappuoli at GlaxoSmithKline (former Chiron Vaccines and Novartis Vaccines and Diagnostics), I leveraged bacterial genome information to pioneer the Reverse Vaccinology approach that identifies and prioritizes novel vaccine candidates using whole genome sequencing. We recently developed an updated reverse vaccine ology pipeline, ReVac, that implements both a panoply of feature prediction programs without filtering out proteins, and scoring of candidates based on predictions made on curated positive and negative control protein datasets. ReVac surveys several genomes assessing protein conservation, as well as DNA and protein repeats, which may result in variable expression of potential vaccine candidates (PVCs). ReVac’s orthologous clustering of conserved genes, identifies core and dispensable genome components. ReVac’s use of a scoring scheme ranks PVCs for subsequent experimental testing. Application of ReVac to two COPD pathogens prioritized PVCs, identifying both novel and previously validated PVCs.
[PI: Hervé, collaborator: Tim Murphy again]
TwinBLAST was developed to allow for easy visual inspection of two BLAST reports simultaneously.
LGTSeek can be used to detect the recent integration of DNA from paired end Illumina data with some knowledge about the donor and/or recipient genome.
FADU - Feature Aggregate Depth Utility
IDEA - Interactive Display for Evolutionary Analyses
IDEA (Interactive Display for Evolutionary Analyses) provides a graphical interface for PAML (Phylogenetic Analysis by Maximum Likelihood), a suite of programs for conducting molecular evolution analyses on nucleotide and amino-acid data. IDEA allows you to run either of the PAML programs, codeml or baseml, on one or more datasets simultaneously to obtain maximum likelihood estimates of numbers of substitutions per branch and per site and to compare multiple models of molecular evolution. IDEA runs on Linux, Solaris and Mac OS X operating systems; it is designed to execute processes in parallel on a multiprocessor machine and can run on a computational grid with support for SGE or Condor. IDEA is available free of charge from SourceForge. Citation: Egan et al. 2008.
ETHA – Exon-Targeted Hybrid Assembly
ETHA (Exon-Targeted Hybrid Assembly) is a software package that allows the user to validate the assembly of Plasmodium falciparum var genes from Illumina and relatively short-length PacBio sequencing data. It is designed to prevent possible sequencing errors and assembly chimeras the (extracellular) exon 1 of var genes, by validating var sequences that are supported by k-mer walks composed of highly trusted k-mers identified in Illumina data. ETHA is available from SourceForge. Citation: Dara et al. 2017.
ISCA - In Silico Capture and Assembly
We are developing a new pipeline, called ISCA (In Silico Capture and Assembly) to perform sequence reconstruction of targeted loci based on in silico capture and assembly of short read whole genome sequence data. XX Under development XX