Institute for Genome Sciences - Information Technology

Informatics Resource Center - Information Technology

To enable our research activities IGS has created a state-of-the-art computational infrastructure that includes a computational grid, a 10 Gbps network infrastructure, high performance database servers, and a tiered storage management system.

IGS is connected by a high-performance switched 10 Gbps network, powered by Cisco equipment, to the rest of the campus. All UMB buildings are connected to the LAN backbone and core switches via fiber cabling. IGS, as part of UMB maintains a 10 Gbps connection via the National Lambda Rail (NLR) to other NLR sites, a 1 Gbps connection to Internet2, the high-speed network designed to facilitate collaboration and communication among research institutions, as well as the aggregated bandwidth of 20 Mbps to the regular Internet network.

IGS has created a state-of-the-art computational infrastructure that includes a computational grid, a 10 Gbps network infrastructure, high performance database servers, and a tiered storage management system.

The computational grid is built around six high-performance high-memory multi-processor machines (64-512 GB RAM, 4-8 CPU multi-core processors) for memory and compute intensive applications such as genome and transcriptome assembly, multiple genome alignment, etc, and over ninety high throughput computational nodes (16-48 GB RAM 2 CPU multi-core Intel Xeon processor machines) for running distributed applications such as BLAST, HMMsearch, etc. The grid scheduling is managed by Sun N1 Grid Engine (SGE) distributed computing system.

To address the ever expanding data sets generated by next generation genome sequencing technologies at a reasonable cost we have deployed a tiered storage infrastructure consisting of 3 tiers of random access storage and a fourth tier of serial access tape media storage for archival and data backup. Tier 1 storage is a high-performance grid-attached storage where most of the computational activities occur. Tier 1 storage is built around a scalable high-performance global file system from EMC/Isilon with a capacity of over 100 TB that currently support over 1 GB/s throughput. Tier 2 storage is built around a scalable high-capacity global file system from EMC/Isilon with a capacity of over 400 TB that currently support 600 GB/s throughput. In addition we have deployed scalable high performance parallel file system from Panasas with a capacity of 450 TB that can currently support an aggregate throughput of over 3GB/s for archival. Tier 4 storage is off-line storage that is used for data backups as well as archival data such as the raw data generated by sequencers. This tier is built around tape library and is integrated with Tier 1/2/3 storage to provide daily, weekly, monthly, and annual backups.

IGS has built a high availability (HA) web infrastructure centered on clustered Apache web servers and load balancers. This ensures minimal downtime and uninterrupted access to our site and data. The IGS bioinformatics group supports MySQL and PostgreSQL databases. All database servers are attached to high performance SAN attached disk systems to ensure speed as well as expandability and are accessible from all desktops and servers at IGS. All storage is accessible from the desktops as well as computational servers via network file sharing (NFS) or similar protocol hosted by high-performance redundant file servers from Panasas, NetApp and EMC.

In addition the IT infrastructure dedicated for IGS use we were recently awarded an NSF MRI-R2 grant to setup a shared computational infrastructure, the Data Intensive Academic Grid (DIAG). The DIAG will meet the bioinformatics needs of over 30 researchers at 6 universities including the University of Maryland School of Medicine. The DIAG includes a computational infrastructure, a high-performance storage network, and optimized data sets generated by mining the data from public data repositories like Camera and NCBI to enable researchers to perform analyses.

The DIAG computational infrastructure includes 1500 cores (125 nodes) for high-throughput computational analysis and 160 cores (5 nodes) for high-performance computational analysis. To handle the large amounts of data generated by sequencing experiments it includes approximately 600 Terabytes (TB) of clustered high-performance parallel file system storage.

The bioinformatics community will access DIAG using Ergatis, a web based pipeline creation and management tool, bioinformatics oriented VMs (CloVR an NSF supported project), as well as interactive and programmatic access using technologies such as Nimbus, and the Virtual Data Toolkit from the Open Science Grid (NSF supported projects). These will allow users to build applications that leverage localization of custom genomic data sets and computational resources to perform analyses previously too difficult for groups with limited informatics support.

At IGS we have over 100 commonly used bioinformatics tools installed centrally that are kept up to date. All IGS employees and collaborators have secure external access to this infrastructure via a virtual private network (VPN) powered by Cisco equipment.