- Naomi Yescas
- Sahil Brahmankar
- Tanner Campbell
- Sateesh Peri
GOAL: Create GEA BLAST service to support genomics training and research for undergraduate students
+ Develop a system that divides and distributes BLAST searches across multiple nodes and processors to obtain results faster. [critical]
+ Host BLAST implementations that support multiple classes of at least 100 students. [critical]
+ Assurance that the ~100 jobs will finish at approximately the same time. [critical]
+ Require support for faculty to create custom BLAST databases and adjust BLAST search parameters [nice-to-have]
+ Require support for caching BLAST results [nice-to-have]
+ Provide authentication for security [nice-to-have]
+ video tutorial demo of the end-product [nice-to-have]
Plan-A (Web-based client-server model)
+ Technology requirements:
+ Web browser
+ Data Store
+ Ease of installation (time and effort)
+ Robustness and fault tolerance
+ Support for multi-node execution
+ Three tier Architecture (Each tier can scale horizontally.)
+ Because the Presentation tier can cache requests, network utilization is minimized.
+ NCBI updates blast databases regularly and having to update databases requires advanced technical skills
+ Potential unknowns & problems
+ Will network speeds between user and HPC cluster affect performance?
+ Accessibility from multiple platforms
+ size of custom blast databases
+ range of blast query sizes
+ exact number of users concurrently submitting jobs
Plan-B (Desktop-based client-server model)
+ The client can share resources (HPC free option)
+ A client/server app have to fetch data only/interface is already installed(hence faster)
+ The application should be installed on client's computer
+ The performance depends on client's computation power
+ Updating interface
+ Less roboust to database downtimes
+ Potential unknowns & problems
+ The feasibility of developing a user interface
Questions for the client
+ Should interface consist of necessary parameters, and handle most of the application and compute-distribution on its own?
+ Is logging in terms of recording usage and resource-specific errors critical to the project?
+ Under Agile principles, we need to know when its "good enough" and ask if ~20 students at 2-3mins almost at the same time or 10 students at 5 mins at exactly the same time, is good enough?
+ Will the class be using NCBI public databases or custom databases?
+ If custom, what is the size of files that would go into making blast databases
+ What is the technology available to faculty & students in the classroom?
Development process (e.g., Agile)
Concept of Operations (ConOps) -> Functional Requirements -> Functional Block Diagram -> Project Proposal -> System Requirements -> System Architecture & Sub Modules -> Minimum Viable Product -> Stretch Goals
Come up with a set of questions to focus down on minimum viable product and verify functional requirements
The functional requirements will define the scope of the project
They will also be what's inside our functional block diagram
Shall statements - what we have to do / Can be updated when we learn more about our system requirements
- The team shall containerize the sequence server and BLAST database or modify existing containers and use a workflow manager to assign tasks consisting of genomic sequence searches over a given database or the public BLAST database to a high-performance computer.
- In order to validate the system performance, the team shall stress the system to the equivalent of 100 sequencing requests and the client will attempt to conduct a typical use case.
- To validate and quantify system effectiveness, performance measures such as time to complete a search under varying circumstances shall be documented.
We can ask if this is an MVP or if its a conservative approximation. With Agile, we need to know when its "good enough" and ask if ~20 students at 2-3mins almost at the same time or 10 students at 5 mins at exactly the same time, is good enough. What are the priorities? We get this done earlier then try to shoot for the 100 later. But with this project, it doesn't seem like we have a lot of time, so the sooner we get the slightest improvement, the better.
Concept of Operations:
Genomics Education Partnership, Wilson Leung
As a biology faculty, I want to have my students run BLAST searches as part of a homework exercise during my class so that they can learn about the different BLAST programs and databases.
I want to run a BLAST search of a protein from an informant genome against the genome assemblies of multiple species so that I can investigate the evolution of this protein.
- Introductory biology lab courses typically have more than 20 students
- Lab courses at different institutions are often scheduled at the same time
- Popular time: Tuesday and Thursday afternoons
- Support multiple classes using the BLAST curriculum concurrently: 100 students
Interfaces: Cyverse, Sequence Server
Auxiliary Equipment: HPC, Docker Containers, Kubernetes, Makeflow
Geographical and physical locations: Cyverse, Public BLAST database, local database
Operating Environment: Multiple locations classrooms with ~20 Students each ~100 total at any given time
9/19/2019: Functional Block Diagram/System Requirements Version-1
10/8/2019: Project Review / Revisit with client & verify functional requirements
10/15/2019: Midterm Presentations
-Improved Sequence Server stack capable of handling at least 100 sequencing requests
-Performance measures to verify improvements
- sequenceServer available as a docker container handles the actual BLAST searches and presents results in a user-friendly UI
- Given that, project main goal is to support parallel sequence searches using the same standalone interface
- Identify and divide project architecture that client can install and configure the web-portal as a separate component on top of an HPC cluster without drastically changing the software or hardware configurations.
- Web interface component
- Data Store component
- Software component
- HPC component
- In the sequenceServer code, identify and intercept when the actual blastn or blastp commands are queued for execution and convert them to a makeflow script for distributed computing