Anjul Bhambhri, VP for Big Data , IBM
Big Data and Better Business Outcomes
Organizations today want to tap into the wealth of information hidden in the data around them to improve competitiveness, efficiency and profitability. Huge volumes of data are created every day from a variety of sources including: sensors, smart devices, social media and billions of Internet and smart phone users worldwide. The challenge is storing, managing and deriving just-in-time insights from this data, while preserving and using existing information management investments. The Big Data challenge is pervasive across the majority of industries including: finance, government, telecommunications, retail, healthcare, energy and utilities.
Happy Science Coding
Zack Booth Simpson, Fellow of the Institute for Cellular and Molecular Biology UT Austin, will demonstrate a new online development platform aimed at scientists called "Happy Science Coding". The web-based platform permits social coding with online documents editable from the browser, simple automatic version control, and a open, server-based execution model that permits any computer in the cloud to serve as an execution machine. A model for maintaining development state will be proposed that would enable scientists to mark their code and dependency state for publication ensuring that others would be able to run the code in perpetuity.
Michael Schatz , CSHL
Entering the era of mega-genomics
The continuing revolution in DNA sequencing and biological sensor technologies is driving a digital transformation to our approaches for observation, experimentation, and interpretation that form the foundation of modern biology and genomics. Whereas classical experiments were limited to thousands of hand-collected observations, today’s improved sensors allow billions of digital observations and are improving at an exponential rate that exceeds Moore’s law. These improvements have made it possible to sequence new genomes and monitor the dynamics of biological processes on an unprecedented “mega-scale”, but have brought proportionally greater quantitative and computational requirements.
SciDB: Large Scale Array Data Management
In this talk you will learn about the SciDB big data storage and analytic platform. In contrast to big data approaches associated with Map/Reduce technologies and inspired by the requirements of web log analysis, SciDB builds on ideas and methods developed over many years to cope with the challenges of large scale scientific data analysis. SciDB is an transactional DBMS that provides its users with a declarative query language build on top of an array data model. Building on its extensibility and MPP foundations, SciDB supports a wide range of statistical and data processing functionality in a similar fashion to ScaLAPACK. SciDB is a completely new implementation that takes advantage of the intrinsic ordering of array data to deliver superior scalability and performance over complex, real-world workloads.
Jeff Kantor, LSST
An Overview of the LSST Data Management System
The LSST Data Management System (DMS) processes the incoming stream of images that the camera system generates to produce transient alerts and to archive the raw images, periodically creates new calibration data products that other processing functions will use, creates and archives an annual Data Release (a static self-consistent collection of data products generated from all survey data taken from the date of survey initiation to the cutoff date for the Data Release), and makes all LSST data available through an interface that uses community-based standards and facilitates user data analysis and production of user-defined data products with supercomputing-scale resources.
Philip Guo, Stanford
CDE: A tool for creating portable experimental software packages
One technical barrier to reproducible computational science is that it is hard to distribute scientific code in a form that other researchers can easily execute on their own computers. Before your colleagues can run your experiments, they must first obtain, install, and configure compatible versions of the appropriate software and their myriad of dependencies. To eliminate this technical barrier, I have created a tool called CDE that automatically packages up all of the software dependencies required to re-run your computational experiments on another computer. CDE is easy to use: All you need to do is execute the commands for your experiment under its supervision, and CDE packages up all of the Code, Data, and Environment that your commands accessed. When you send that self-contained package to your colleagues, they can re-run those exact commands on their computers without first installing or configuring anything. CDE is free and open source, available at http://www.pgbovine.net/cde.html
Open Source Software for Scientific Data Analysis and Presentation
Scientific software development often consists of searching for a useful algorithm and then spending additional weeks decoding the text of a paper to actually reach a usable piece of code. However, much of what is produced is derivative and can be thought of as commodity. Kitware, Inc. is a company founded to give away commodity software allowing scientists and Kitware developers to focus on the science of generating new answers instead of implementing old solutions. In this talk I will present an open source toolkit for scientific data visualization and exploration along with some of the open source applications that Kitware has derived. All of the code I will present is freely available with liberal, Apache 2 licensing terms.
TBD->Large scale database infrastructure: DynamoDB