This box searches only this space. The box at the upper right searches the entire iPlant wiki.

Skip to end of metadata
Go to start of metadata

NOTE: This document is in progress. Changes will happen.

Final Project

Presentation  (8-10A)

All Components Due:   (by Noon)

Learning Objectives

  • Working on a Collaborative Team
    • Some projects are too big to be done alone. How to work with others to effectively tackle these types of projects.
  • Professionalism in Projects
    • Creating accurate, useful and high-quality deliverables in client-centered projects. Learning how to communicate with clients
  • Understanding Data
    • Obtaining, cleaning, managing, and using data.
  • Scaling Analyses
    • Testing and benchmarking workflows.
    • Leveraging multiple machines to perform large scale analyses.
    • Choosing the right resource for the right problem.
  • Code Availability
    • Creating well documented, publically available codebases that are reusable by others.

Project Details

Summary: Using observational data (obtained from eButterfly & iNaturalist), species distribution model maps will be computed for ~700 butterfly species in North America using a variety of models & parameter sets over a variety of timeframes. These maps will be used to create 1) an interactive system for viewing distribution maps by species and timeframe, and 2) overall biodiversity map(s) of North American butterflies.

The class will divide into 2 teams. Each will complete the project in its entirety. Details are specified in the Teams section.

Four final "deliverables" need to be submitted. Details for each are specified in the Deliverables section.

  1. Project Report
  2. Project Code
  3. Project Data & Results
  4. Client Presentation

Getting Help:

  • Clients will be in class on 11/30 to answer questions and have a discussion

    • Develop list of questions to discuss by 11/28

  • Further questions of the client will be necessary - ask these on the wiki (so everyone can read them) using the "Final - Q&A" Subpage

  • If you are unsure about what the client wants vs what the class final is asking, please talk to instructors and clients.

Project Goals

  • Obtain observational data for butterfly species in North America, check and clean data.
    • Data should be obtained from eButterfly
    • Data needs to be validated and cleaned to ensure high-quality results.
    • The final set of cleaned, well-organized data should be made available.
  • Using observational data, run a variety of species distribution models (SDMs) to generate maps of species North American distribution of ~700 butterfly species over multiple timeframes.
    • Species:
    • Timeframes: Monthly and all months combined (so 13 for each species)
    • Multiple SDMs will be used, and for some multiple parameter sets need to be tested.
      • For each species / time combination, will need to run SDMs using each of three algorithms ("CTA", "GLM", and "RF"). Use the script `run-sdm-algo.R` to dictate the algorithm. Output file names should include the algorithm used.
      • For each algorithm, should run SDM with 1, 10, and 50 background replicates. Output file names should include the number of background replicates.
    • Analyses need to be tested and benchmarked before scaling to complete analyses.
  • Using maps generated by SDMs, create an interactive system for viewing results by species, timeframe, and SDM/parameter set.
  • A single map (for each algorithm & parameter combination) showing species richness (number of species) estimated by SDMs for all species.
  • Publically available code, documentation, data and results.
    • All code needs to be housed and well documented on GitHub. Someone should be able to easily clone your repository and repeat all steps of the analysis process
    • Final cleaned data and results need to be available, preferably via the CyVerse Data Store.
    • Interactive system for viewing results needs to be documented to the point that it is easy for someone to get it up and running.

Teams

  • The Final Project will be completed by 2 teams.
  • Each team needs to complete the project in its entirety. Between-team collaboration is permitted (& encouraged).
  • Each team should be subdivided into a number of sub-teams.
    • Sub-teams need to be structured such that no team is waiting for another to complete before they can begin.
    • People can be on more than one team.
    • Subteam Ideas:
      1. Project Management: Organizing overall project, coordinating presentation, etc. Project management team members should also be active on at least one other team.
      2. Data Management: Obtaining, cleaning, organizing, and distributing the data.
      3. Computation: Running the models, incremental benchmarking & scaling, producing final computed datasets.
      4. Visualization: Creating high quality visualizations of computed data, producing informative visualizations regarding data quantity & quality.
      5. Documentation: Writing (& checking) documentation (wiki and GitHub), preparing the final presentation.

Data

eButterfly

  • PostgreSQL Dump Download: https://de.cyverse.org/dl/d/BA2D5507-1F85-4A75-8F11-5B537E44A2D9/ebutterfly-acic.sql
  • To create the empty database locally: 

    • sudo -u postgres createdb ebutterfly
  • To load the data:

    • sudo psql -h localhost -U postgres -d ebutterfly -f ebutterfly-acic.sql
  • For the ebutterfly database, there are two schema, eb_butterflies and eb_central, with tables containing the data of interest.
    • We'll want analyses for each species listed in eb_butterflies.species.
      • (i.e. all unique values in the species_id column).
    • Individual observations will be in eb_ebutterflies.observations.
      • We are only interested in:
        • Observations with high ID confidence (see idconfidence_id field and corresponding values in eb_central.idconfidences)
        • Observations of adults (see lifestage_id and corresponding values in eb_central.lifestages)
        • Observations that have been vetted or are pending (see observationstatus_id and corresponding values in eb_central.observationstatuses)
    • Latitude and longitude are not stored in the observations table, but rather in eb_central.sites. To retrieve latitude and longitude for a single observation, you will likely need to use JOIN in queries, relying on the following keys:
      • checklist_id in eb_butterflies.observations and eb_central.checklists
      • site_id in eb_central.checklists and eb_central.sites

iNaturalist

 

Deliverables

Note: All deliverables must be fully public, accessible, and use a public license (unless conflicting with another license)

1. Project Report (Wiki | Due Dec. 14th @ Noon)

  1. Project Overview

  2. Team Members (names and roles)

  3. Abstract: Overview of analysis workflow, and results management, visualization, and user interaction.

  4. Input Data

    • Description of data.

    • Where/how were they obtained.

    • What was done to clean them for processing.

    • Link, description, and instructions of using code to get and clean data.

    • Link to publically available cleaned data set.

  5. Analysis Workflow/Code:

    • Description of analysis workflow.

    • Instructions for use

    • Number of input datasets

    • Describe scaling methods

    • Benchmarks from test data and whole data

      • How much would this have cost in Amazon Web Services

    • How was workflow validated (results are valid)

  6. Results

    • How are results managed

    • How are results evaluated for faults

    • Where can you find the results?

  7. Interactive user interface

    • What does it take to run the final interface

    • Detailed description/Tutorial of how to use it

  8. Project plan and timeline

  9. Detailed benchmarks

    • How long it took to run the full dataset:

    • Software installation

    • Data Staging

    • Data Processing

    • Workflow monitoring

    • Visualization of results

    • Results deposition

  10. Post-Mortem Analysis (SubPage)

    • Focus on team processes; not technical problems

    • What worked well

    • What didn't work well

    • What you would do differently

2. Project Code (GitHub | Due Dec. 14th @ Noon)

  • Overview
  • System requirements

  • Getting Started

  • How to use

  • How to scale (what is needed)

  • Description of output

  • Warnings and caveats

  • License

  • Author description

  • Where do get additional help

3. Project Data & Results (CyVerse | Due Dec. 12 @ 8AM)

  • Project Data (final cleaned observational dataset) and Results (SDM maps) need to be well organized and easily available.
  • Interactive system needs to be available (or easily set up w/ clear instructions). 
    • A live demonstration of this is required during the presentation, and each team will need to show the data for a specific species/timeframe/model as requested.
    • Clear instructions for someone to use this system need to be provided.

4. Client Presentation (In-Class | Due Dec. 12 @ 8AM)

  • Introduction

  • Team members, roles and responsiblities

  • Overview of project

  • Overview of results

  • Details on:

    • Data: obtaining, cleaning, about the final dataset.

    • Analysis workflow: overview, obtaining, how to use

    • Benchmarks: Scaling, full analysis time

    • Results and result management

    • Final user interface for interacting with results

  • Code:

    • Where is it?

    • What does it take to rerun it?

    • Show an example of making a modification of code an rerunning

  • Documentation and training materials

    • How would a new person take the code and get to work?

    • How would a person take the code and swap out the SDM for another model, set of models, or set of parameters?

  • Live demo of final user interface for interacting with results

    • We will be asking each team to demonstrate for one of the species. Which one? Thats the surprise!

5. Teammate Evaluations

 

 

  • No labels