Bioinformatics - Pipelines

IGS Prokaryotic Annotation Pipeline

IGS has developed a comprehensive automated pipeline for use with Bacteria and Archaea (Galens, et. al., PMID:21677861). The pipeline predicts protein-coding genes as well as non-coding RNAs. Similarity evidence is collected for predicted proteins with a variety of methods including pairwise alignments, HMM searches, and multiple motif prediction tools. A hierarchical rule-based system is used to assign annotation to each protein based on the highest quality available evidence. Results are loaded into a relational database and can be viewed using the Manatee annotation visualization and curation tool. Results are also available in multiple standard flat file formats.

Transcriptome Analysis Pipeline

Included in this pipeline is the alignment of reads to a reference genome, RPKM analysis differential expression analysis, isoform analysis and differential isoform analysis. We are also able to do de novo transcriptome assembly. Results are output as spreadsheets containing statistics, differentially expressed genes, isoforms and differentially expressed isoforms as well as pdf plots and figures. Visualization tools such as the Integrative Genome Browser (IGV) can be used.

Comparative genomics using protein clusters

This pipeline uses Jaccard filtered bi-directional best blast matches to produce ortholog clusters (Crabtree, et. al., PMID:18314579). It has been successfully used for the comparison of 100 (or more) genomes at one time. The web-based visualization tool Sybil is used to search and view ortholog clusters, genomic context, synteny, and more.

Comparative genomics using Mugsy

This method employs the Mugsy whole genome alignment algorithm (Angiuoli, et. al., PMID:21148543). Mugsy is a reference-independent tool that builds protein ortholog groups based on whole genome multiple alignments and synteny thus helping to differentiate between paralogs and orthologs. This method is optimized for comparing closely related organisms. The web-based visualization tool Sybil is used to search and view ortholog clusters, genomic context, synteny, and more.


Genomic Metadata for Infectious Agents, is an open source web-based pathogen centric tool designed to provide targeted DNA Signature selection of the NIAID category A-C viral and bacterial pathogens. A representative genomic sequence is identified for each pathogen by the Gemina system and utilized for the Insignia DNA Signature pipeline.

The Gemina system describes the Who [Host], What [Disease, Symptom], When [Date], Where [Location] and How [Pathogen, Environmental Source, Reservoir, Transmission Method] of infectious pathogens.

The Gemina system provides an integrated investigative and geospatial surveillance system connecting pathogens, pathogen products and disease anchored on the taxonomic ID of the pathogen and host, linking for the first time unique genomic representations of each pathogen with ontology regularized metadata for the associated epidemiological information. The Gemina system has been developed with a straightforward text based query interface, a java-based ontology tree viewer interface for deeper exploration of the ontologies, geospatial surveillance functionality to view the progression of pathogens spatially and over the course of time and a selection tool for DNA signatures to provide a set of resources for pathogen surveillance, metadata investigation and DNA diagnostics of the NIAID category A-C bacterial and viral pathogens. The Gemina web interface, provides access to data extracted from PubMed articles for the NIAID category A-C viral and bacterial pathogens through a set of metadata controlled vocabularies for Toxins, Reservoirs, Environmental Sources (EnvO), Geographic Locations (Gaz), Diseases, Anatomy, Transmission Methods, and Symptoms. This strategy allows users to build a multi-term query using one or more metadata types representing the Gemina chain of infection data model.

The Gemina system enables users to explore the diversity of outbreak data reported in literature that has been regularized through a set of mature community-adopted ontologies, for each NIAID category A-C pathogen, to identify the breadth of hosts and diseases known for these pathogens, to identify where these pathogens have been reported to occur in the world and to link to the Insignia Signature Detection tool to identify unique regions within the genomes of these pathogens. In the 06/23/2009 release of Gemina, the database contained 367 bacterial, 21 toxin strains and 10,991 viral strains including Influenza A, B, and C subtypes and strains such as the 2009 Swine Flu H1N1 strains.