This box searches only this space. The box at the upper right searches the entire iPlant wiki.

Skip to end of metadata
Go to start of metadata

Schedule for 2012:

2012  Spring topics will primarily focus on "Big Data" and we intersperse that with social coding, experimental reproducibility talks. 








Feb 8

Anjul Bhambhri, VP for Big Data , IBM


Big Data and Better Business Outcomes

Nirav Merchant


Organizations today want to tap into the wealth of information hidden in the data around them to improve competitiveness, efficiency and profitability. Huge volumes of data are created every day from a variety of sources including: sensors, smart devices, social media and billions of Internet and smart phone users worldwide. The challenge is storing, managing and deriving just-in-time insights from this data, while preserving and using existing information management investments. The Big Data challenge is pervasive across the majority of industries including: finance, government, telecommunications, retail, healthcare, energy and utilities. 

This presentation will look at emerging technologies (Watson), evolving roles (data scientists) and how technology can drive better business outcomes. Only through  Integrating Big Data and applying context, patterns and intelligence will drive new business efficiencies. 

Feb 22

Zachary Simpson


Happy Science Coding

Nirav Merchant

Zack Booth Simpson, Fellow of the Institute for Cellular and Molecular Biology UT Austin, will demonstrate a new online development platform aimed at scientists called "Happy Science Coding".  The web-based platform permits social coding  with online documents editable from the browser, simple automatic version control, and a open, server-based execution model that permits any computer in the cloud to serve as an execution machine.  A model for maintaining development state  will be proposed that would enable scientists to mark their code and dependency state for publication ensuring that others would be able to run the code in perpetuity.


March 14

Michael Schatz , CSHL


Entering the era of mega-genomics

Nirav Merchant

The continuing revolution in DNA sequencing and biological sensor technologies is driving a digital transformation to our approaches for observation, experimentation, and interpretation that form the foundation of modern biology and genomics. Whereas classical experiments were limited to thousands of hand-collected observations, today’s improved sensors allow billions of digital observations and are improving at an exponential rate that exceeds Moore’s law. These improvements have made it possible to sequence new genomes and monitor the dynamics of biological processes on an unprecedented “mega-scale”, but have brought proportionally greater quantitative and computational requirements.

The growing digital demands have motivated extensive research into computational algorithms and parallel systems for analysis. Recently a great deal of research has been focused on applying emerging scalable computing systems to genomic research. One of the most promising is the Hadoop open-source implementation of MapReduce: it is specifically designed to scale to very large datasets, its intuitive design supports rich parallel algorithms, and is naturally applied to analysis of many biological assays. During my presentation, I will describe some recent innovations using these and other technologies for large-scale genome assembly, variation detection, and transcription analysis. These are promising early results but continued research is essential in the coming years, especially as we hope to model and mine these data to uncover genotype-to-phenotype relations that can only be detected across very large populations.

March 28

Paul Brown


SciDB: Large Scale Array Data Management

Nirav Merchant

In this talk you will learn about the SciDB big data storage and analytic platform. In contrast to big data approaches associated with Map/Reduce technologies and inspired by the requirements of web log analysis, SciDB builds on ideas and methods developed over many years to cope with the challenges of large scale scientific data analysis. SciDB is an transactional DBMS that provides its users with a declarative query language build on top of an array data model. Building on its extensibility and MPP foundations, SciDB supports a wide range of statistical and data processing functionality in a similar fashion to ScaLAPACK. SciDB is a completely new implementation that takes advantage of the intrinsic ordering of array data to deliver superior scalability and performance over complex, real-world workloads.

April 11

Jeff Kantor, LSST


An Overview of the LSST Data Management System

Andrew Lenards

The LSST Data Management System (DMS) processes the incoming stream of images that the camera system generates to produce transient alerts and to archive the raw images, periodically creates new calibration data products that other processing functions will use, creates and archives an annual Data Release (a static self-consistent collection of data products generated from all survey data taken from the date of survey initiation to the cutoff date for the Data Release), and makes all LSST data available through an interface that uses community-based standards and facilitates user data analysis and production of user-defined data products with supercomputing-scale resources.

This paper discusses DMS distributed processing and data, and DMS architecture and design, with an emphasis on the particular technical challenges that must be met. The DMS publishes transient alerts in community-standard formats (e.g. VOEvent) within 60 seconds of detection. The DMS processes and archives over 50 petabytes of exposures (over the 10-year survey). Data Releases, include catalogs of tens of trillions of detected sources and tens of billions of astronomical objects, 2000-deep co-added exposures, and calibration products accurate to standards not achieved in wide-field survey instruments to date. These Data Releases grow in size to tens of petabytes over the survey period. The expected data access patterns drive the design of the database and data access services. Finally, the DMS permits interactive analysis and provides nightly summary statistics describing DMS output quality and performance.

April 25

Philip Guo, Stanford


CDE: A tool for creating portable experimental software packages

Nirav Merchant

One technical barrier to reproducible computational science is that it is hard to distribute scientific code in a form that other researchers can easily execute on their own computers.  Before your colleagues can run your experiments, they must first obtain, install, and configure compatible versions of the appropriate software and their myriad of dependencies.  To eliminate this technical barrier, I have created a tool called CDE that automatically packages up all of the software dependencies required to re-run your computational experiments on another computer.  CDE is easy to use: All you need to do is execute the commands for your experiment under its supervision, and CDE packages up all of the Code, Data, and Environment that your commands accessed.  When you send that self-contained package to your colleagues, they can re-run those exact commands on their computers without first installing or configuring anything. CDE is free and open source, available at


May 9

Wes Turner


Open Source Software for Scientific Data Analysis and Presentation

Eric Lyons


Scientific software development often consists of searching for a useful algorithm and then spending additional weeks decoding the text of a paper to actually reach a usable piece of code.  However, much of what is produced is derivative and can be thought of as commodity.  Kitware, Inc. is a company founded to give away commodity software allowing scientists and Kitware developers to focus on the science of generating new answers instead of implementing old solutions.  In this talk I will present an open source toolkit for scientific data visualization and exploration along with some of the open source applications that Kitware has derived.  All of the code I will present is freely available with liberal, Apache 2 licensing terms.

May 23

Amazon AWS


TBD->Large scale database infrastructure: DynamoDB

Nirav Merchant


  • No labels