This box searches only this space. The box at the upper right searches the entire iPlant wiki.

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Team Frankenstein’s solution sets the stage for a completely offline solution, assuming the students and master machine have Docker already installed. This is a route that we have not explored, but wanted to make the user aware that this is an option. 

Description of Project 


Image RemovedImage Added



Code Availability 

Github Repository - https://github.com/GEABlast/frankensteinDockerhub Repo - 

Full Report on Wiki - https://hubwiki.dockercyverse.comorg/rwiki/jforstedt/sswq1display/A2/Midterm+Deliverable 


Installing and Running Instructions 

Before we go about explaining the instructions, it must be noted that as of right now, we only have one downloadable file for both worker and master. This is because we downloaded the test DB under SequenceServer.  This does not impede the project as both master and worker are going to be downloading the same file, but in a perfect world the worker container would not include a SequenceServer downloadable and have the test DB outside of SeqSever. From here on out though, we are going to asssume this has already been done. Please keep in mind, the installation process does not change for either file and the master downloadable instructions should be followed on both master/worker. 


To begin this process, the master is going to have to create a Cyverse master instance, which can be a ‘tiny1’ size or bigger, although any ‘tiny1’ sized instance will suffice for this solution. Once the instance is up and running, the teacher needs to download the master docker file using the ‘wget’ command. Once the master docker file is downloaded, the professor can look at the IPv4 of the instance found on the Cyverse Launch page and share that with students via email, verbal communication, etc. The Master docker file will have downloaded SequenceServer, a test blast DB, and the CCTools 7.0.19 tools suite. It will operate over the SeqServer default port 4567, which must also be shared with the students via email, verbal communication, etc.

...

On the student side, they will go through a similar process that includes downloading the docker file which includes CCTools 7.0.19 and the test DB. This process will be done on the local machine, and the professor must help them set up their machine to talk on the correct port/IP address. Once the class has all been hooked up to the same Master instance, they are now able to submit BLAST queries through the master instance which has sequence server, and thus a que, that jobs will be filed through. Submitted jobs will then be distributed across the classroom, using the classroom’s local resources to accomplish jobs.      


This essentially allows for a scalable solution that is as efficient as the resources on hand allow. Their are admittedly shortcomings to our solution that have potential for improvement. One aspect is that our benchmarking times in a live use case would be varied because the worker cluster efficiency relies on computational resources in the classroom. The more students we have as workers, the quicker the times and vice versa. Another shortcoming is that this solution is meant to use small databases; because we are using student local machines, synching large databases across the local wifi network may cause networking issues/failure. The last caveat is that we are assuming the students and professor can follow our directions to input the port/IP address; in other words, this is not a download and done solution, there is some user input required. 

...

We have also included a tutorial for how to create and upload your own custom database as seen in Reetu & Friends solution, should the instructor trust the classroom bandwidth to handle it.  It can be seen on the wiki here. This solution uses iRODS, which is what Cyverse uses to move data to/from their data storage. We have made a tutorial for how to download iRODS which can be followed on the wiki. here. (INSERT LINK)


Team Members Contribution


Josh Forstedt – Linked independent master and worker VMs running docker containers together, running BLASTEasy’s modified 1.0.12 Sequence Server with WorkQueue. 


Derek Caldwell - keeping track of documentation, final presentation,  progress presentations in class, and attending class collaboration meetings. 


Nasser Albalawi - contributed is in creating VMs to test benchmarking and Creating” Master Machine”, created docker image for the master,  having hands on makeflow and work queue attending class meeting, attending workshops,    contributed with writing and presentation.    


Asiedu Owusu-Kyereko : Worked  - Contributed on making the worker docker container. Contributed with write-up and presentation, attending collaboration meetings
Jaeden Garcia - Created the Database tutorial and attended class collaboration meeting

Special Thanks to John Xu, Sateesh Peri, and Team BLASTEasy.


Project Timeline 



Benchmarking 

So unfortunately we did For the following benchmarking portion, we had the entire team of 5 people run BLAST queries through our solution. Derek could not get an in-depth look at the benchmarking process of our project, however we anticipate it to be similar to BLASTeasy’s solution considering the overall architecture is similar. We recognize that since our solution relies on the classroom’s collective computing resources,  in a live use case benchmarking times would be inconsistent; the more computers connected, the quicker benchmarking times. The other obvious difference between our solution and BLASTeasy’s is that ours is containerized.

The containerization of the classroom cluster implementation may or may not tack on time to the benchmarking process. The following results were run from our initial benchmarking using this solution with both the master and workers as VM’s. This provides a ballpark for how long these queries may take.  

For benchmarking, we ran the test on four different virtual  machines with four different types of cores (1, 2, 4, and 8 cores) and each machine has different size of memory and harddisk. The instance size of the four virtual machines is presented as following:

 

...

Instance Name

...

Instance size

...

Ubuntu 18_04 GUI XFCE Base

...

Tiny 2(CPU:1  Mem:8 GB, Disk:60 GB)

...

Ubuntu 18_04 GUI XFCE Base

...

Small 2(CPU:2  Mem:16 GB, Disk:120 GB)

...

Ubuntu 18_04 GUI XFCE Base

...

Medium 3(CPU:4  Mem:32 GB, Disk:240 GB)

...

Ubuntu 18_04 GUI XFCE Base

...

Large 3(CPU:8 Mem:64 GB, Disk:480 GB)

 

Image Removed

Presentation

viewpptname429_Final_Presentation.pptxinstance up and running in time, so he was considered to be the ‘outlier’, submitting jobs as a non-worker. This is why there is no core count next to his name in the graph. Nonetheless, Josh was able to start up two Ubuntu instances (4 core each) and satisfy the fifth worker requirement. 

For the first nucleotide benchmarking, we took the random generator and had it create sequences from 1000- 500000 sequences long. What we found in the beginning was expected; the smaller the queries the quicker the times. We ended up not being able to break the solution or crash the browser; everything ran fine for the nucleotide sequences. This was not the case with the protein sequences.  

 

Sequence Length (DNA)

4 core (Naseer)

4 core (Josh)

8 core (Jaeden)

8 core (Ace)

Derek

1000

0.55

0.4

1

2

0.19

2000

0.28

0.67

1.8

1.87

0.14

5000

2.49

1.32

2.1

2.5

0.17

15000

2.44

2.9

3.2

3.5

2.9

50000

7.04

6.41

7.5

7

7.6

100000

15.49

15.57

15.5

15.5

15.59

500000

56.1

56.43

56.6

57

56.8

Nucleotide

     

 

 To benchmark the protein sequences, we used an online random protein sequence generator that only gave us the ability to go from 1 - 100 sequences. In this test, we experienced increasingly diversified times depending on our devices. Those with more RAM in their machines had quicker times, so we recognize by the benchmarking process that something is running or processing on the local memory. The following table illustrates our results. 

 

Sequence Length (DNA)

4 core (Naseer)

4 core (Josh)

8 core (Jaeden)

8 core (Ace)

Derek

1

2.65

3.02

2.71

3.18

8.37

10

17.49

17.47

16.82

21

53.35

50

01:28.8

01:29.4

1:30

2:05

3:22

100

03:01.0

03:03.4

3:05

05:25.3

7:48

Protein

     

 

Presentation:

Input Exported PowerPoint Presentation. https://docs.google.com/presentation/d/1SOUsKjVtZrnL7GM0E0m_KDS_d3FSJZITMnHaoGh_rFs/edit?pli=1#slide=id.g73ce69d719_2_5

Post-Mortem Analysis

The Good: As far as the things that went right throughout this project, I think the team learned about some interesting tools and how they can be theoretically applied. We tried our best to help each other out if there were questions, and the support from other teams/classmates was crucial to our success.  

What Could Have Been Done Better: I think the teams collective lack of experience held us back; without Josh we really do not have any solid command line experience. Maybe adding one more grad student to our team would have increased our chances at getting this done in a more timely fashion. Maybe we could have met more, either online or in person, to do the homework so we could get more out of the class. We could also have communicated more frequently throughout the project instead of the week of the due date. .