IT
The Institute's equipment is capable of generating gigabytes of complex data and processing it relatively quickly. To enable this "next generation" of bioinformatics data, IGS has created a state-of-the-art computational infrastructure that includes a computational grid, a 10 Gbps network infrastructure, large scale database servers, and a hierarchical storage management system.
IGS is connected by a high-performance switched gigabit network, powered by Cisco equipment, to the rest of the University of Maryland Baltimore (UMB) campus. All UMB buildings are connected to the LAN backbone and core switches via fiber cabling. IGS, as part of UMB maintains a 45 Mpbs (1Gbps capable) connection to Internet2, the high-speed network designed to facilitate collaboration and communication among research institutions, as well as the aggregated bandwidth of 20 Mbps to the regular Internet network. UMB and IGS have also recently joined the National Lambda Rail (NLR) the ultra-high performance network infrastructure designed to enable bandwidth intensive distributed research projects.
The computational grid is built primarily around a collection (100 cores) of high-performance high-memory multi-processor machines (64-128 GB RAM, 4-8 CPU multi-core processors) for memory and compute intensive applications such as assembly, multiple genome alignment, etc, and a larger number (400 cores) of high throughput computational nodes (16 GB RAM 2 CPU multi-core Intel Xeon processor machines) for running distributed applications such as BLAST, HMMsearch, etc. The grid scheduling is managed by Sun Grid Engine (SGE) distributed computing system.
To address the ever expanding data sets generated by next generation genome sequencing technologies at a reasonable cost we have deployed a hierarchical storage infrastructure consisting of 3 tiers of random access storage and a fourth tier of serial access tape media storage for archival and data backup. Tier 1 storage is grid attached storage with a capacity of 16 TB where most of the computational activities will occur. This tier of storage can currently support over 400MB/s throughput that can be scaled easily at nominal additional cost to support additional throughput. Tier 2 storage is a high performance storage that hosts mission critical data for active projects and has a current capacity of 40 TB with 125 MB/s IO throughput. Tier 3 storage is near-line storage used for hosting less frequently used data that still needs to have random access. This tier currently has a capacity of 64 TB with an aggregate throughput of 35 MB/s. Tier 4 storage is off-line storage that is used for data backups as well as archival data such as the raw data generated by sequencers. This tier is built around tape library and is integrated with Tier 2/3 storage to provide daily, weekly, monthly, and annual backups.
IGS is in the process of building a high availability (HA) web infrastructure centered on clustered Apache web servers. This will ensure minimal downtime and uninterrupted access to our site and data. This high availability paradigm is also adopted for databases where we are in the process of deploying clustered Oracle database servers for internal as well as external use. In addition to Oracle, we at IGS support MySQL and Postgres to ensure that we can deploy other open source as well as commercial tools that may be specifically targeted to these database engines.
All database servers are attached to high performance SAN attached disk systems to ensure speed as well as expandability and are accessible from all desktops and servers at IGS. All this storage is accessible from the desktops as well as computational servers via network file sharing (NFS) protocol hosted by high-performance redundant NFS gateways from NetApp. At IGS we have over 100 commonly used bioinformatics tools installed centrally that are kept up to date. All IGS employees and collaborators have secure external access to this infrastructure via a virtual private network (VPN) powered by Cisco equipment.