NOTE: This document is in progress. Changes will happen.
Final Project
Presentation (8-10A)
All Components Due: (by Noon)
Learning Objectives
- Working on a Collaborative Team
- Some projects are too big to be done alone. How to work with others to effectively tackle these types of projects.
- Professionalism in Projects
- Creating accurate, useful and high-quality deliverables in client-centered projects. Learning how to communicate with clients
- Understanding Data
- Obtaining, cleaning, managing, and using data.
- Scaling Analyses
- Testing and benchmarking workflows.
- Leveraging multiple machines to perform large scale analyses.
- Choosing the right resource for the right problem.
- Code Availability
- Creating well documented, publically available codebases that are reusable by others.
Project Details
Summary: Using observational data (obtained from eButterfly & iNaturalist), species distribution model maps will be computed for ~700 butterfly species in North America using a variety of models & parameter sets over a variety of timeframes. These maps will be used to create 1) an interactive system for viewing distribution maps by species and timeframe, and 2) overall biodiversity map(s) of North American butterflies.
The class will divide into 2 teams. Each will complete the project in its entirety. Details are specified in the Teams section.
Four final "deliverables" need to be submitted. Details for each are specified in the Deliverables section.
- Project Report
- Project Code
- Project Data & Results
- Client Presentation
Getting Help:
Clients will be in class on 11/30 to answer questions and have a discussion
Develop list of questions to discuss by 11/28
Further questions of the client will be necessary - ask these on the wiki (so everyone can read them) using the "Final - Q&A" Subpage
If you are unsure about what the client wants vs what the class final is asking, please talk to instructors and clients.
Project Goals
- Obtain observational data for butterfly species in North America, check and clean data.
- Data should be obtained from eButterfly
- Data needs to be validated and cleaned to ensure high-quality results.
- The final set of cleaned, well-organized data should be made available.
- Using observational data, run a variety of species distribution models (SDMs) to generate maps of species North American distribution of ~700 butterfly species over multiple timeframes.
- Species:
- The species from iNaturalist are those listed in the file in the GitHub repository ebutterfly-sdm/data/gbif/taxon-ids.txt (https://github.com/jcoliver/ebutterfly-sdm/blob/master/data/gbif/taxon-ids.txt)
- The species from eButterfly are all the species listed in the eb_butterflies.species table
- Timeframes: Monthly and all months combined (so 13 for each species)
- Multiple SDMs will be used, and for some multiple parameter sets need to be tested.
- For each species / time combination, will need to run SDMs using each of three algorithms ("CTA", "GLM", and "RF"). Use the script `run-sdm-algo.R` to dictate the algorithm. Output file names should include the algorithm used.
- For each algorithm, should run SDM with 1, 10, and 50 background replicates. Output file names should include the number of background replicates.
- Analyses need to be tested and benchmarked before scaling to complete analyses.
- Species:
- Using maps generated by SDMs, create an interactive system for viewing results by species, timeframe, and SDM/parameter set.
- Technologies for this might include: Jupyter Notebooks, qGIS, Leaflet, or others.
- Animations, showing annual change in predicted ranges would be useful. Examples include:
eBird occurrence animation: http://ebird.org/content/ebird/occurrence/
Migration animation: https://imgur.com/gallery/ptIARv7
- A single map (for each algorithm & parameter combination) showing species richness (number of species) estimated by SDMs for all species.
- This is effectively a stack of all SDMs. The script `stack-sdms.R` could be used to combine raster files into a single map.
- Examples include:
- Publically available code, documentation, data and results.
- All code needs to be housed and well documented on GitHub. Someone should be able to easily clone your repository and repeat all steps of the analysis process
- Final cleaned data and results need to be available, preferably via the CyVerse Data Store.
- Interactive system for viewing results needs to be documented to the point that it is easy for someone to get it up and running.
Teams
- The Final Project will be completed by 2 teams.
- Each team needs to complete the project in its entirety. Between-team collaboration is permitted (& encouraged).
- Each team should be subdivided into a number of sub-teams.
- Sub-teams need to be structured such that no team is waiting for another to complete before they can begin.
- People can be on more than one team.
- Subteam Ideas:
- Project Management: Organizing overall project, coordinating presentation, etc. Project management team members should also be active on at least one other team.
- Data Management: Obtaining, cleaning, organizing, and distributing the data.
- Computation: Running the models, incremental benchmarking & scaling, producing final computed datasets.
- Visualization: Creating high quality visualizations of computed data, producing informative visualizations regarding data quantity & quality.
- Documentation: Writing (& checking) documentation (wiki and GitHub), preparing the final presentation.
Data
eButterfly
- PostgreSQL Dump Download: https://de.cyverse.org/dl/d/BA2D5507-1F85-4A75-8F11-5B537E44A2D9/ebutterfly-acic.sql
To create the empty database locally:
sudo -u postgres createdb ebutterfly
To load the data:
sudo psql -h localhost -U postgres -d ebutterfly -f ebutterfly-acic.sql
- For the ebutterfly database, there are two schema,
eb_butterflies
andeb_central
, with tables containing the data of interest.- We'll want analyses for each species listed in
eb_butterflies.species
.- (i.e. all unique values in the
species_id
column).
- (i.e. all unique values in the
- Individual observations will be in
eb_ebutterflies.observations
.- We are only interested in:
- Observations with high ID confidence (see
idconfidence_id
field and corresponding values ineb_central.idconfidences
) - Observations of adults (see
lifestage_id
and corresponding values ineb_central.lifestages
) - Observations that have been vetted or are pending (see
observationstatus_id
and corresponding values ineb_central.observationstatuses
)
- Observations with high ID confidence (see
- We are only interested in:
- Latitude and longitude are not stored in the observations table, but rather in
eb_central.sites
. To retrieve latitude and longitude for a single observation, you will likely need to useJOIN
in queries, relying on the following keys:checklist_id
ineb_butterflies.observations
andeb_central.checklists
site_id
ineb_central.checklists
andeb_central.sites
- We'll want analyses for each species listed in
iNaturalist
Deliverables
Note: All deliverables must be fully public, accessible, and use a public license (unless conflicting with another license)
1. Project Report (Wiki | Due Dec. 14th @ Noon)
Project Overview
Team Members (names and roles)
Abstract: Overview of analysis workflow, and results management, visualization, and user interaction.
Input Data
Description of data.
Where/how were they obtained.
What was done to clean them for processing.
Link, description, and instructions of using code to get and clean data.
Link to publically available cleaned data set.
Analysis Workflow/Code:
Description of analysis workflow.
Instructions for use
Number of input datasets
Describe scaling methods
Benchmarks from test data and whole data
How much would this have cost in Amazon Web Services
How was workflow validated (results are valid)
Results
How are results managed
How are results evaluated for faults
Where can you find the results?
Interactive user interface
What does it take to run the final interface
Detailed description/Tutorial of how to use it
Project plan and timeline
Detailed benchmarks
How long it took to run the full dataset:
Software installation
Data Staging
Data Processing
Workflow monitoring
Visualization of results
Results deposition
Post-Mortem Analysis (SubPage)
Focus on team processes; not technical problems
What worked well
What didn't work well
What you would do differently
2. Project Code (GitHub | Due Dec. 14th @ Noon)
- Overview
System requirements
Getting Started
How to use
How to scale (what is needed)
Description of output
Warnings and caveats
License
Author description
Where do get additional help
3. Project Data & Results (CyVerse | Due Dec. 12 @ 8AM)
- Project Data (final cleaned observational dataset) and Results (SDM maps) need to be well organized and easily available.
- It is suggested to use the CyVerse datastore to house these data. See Using the Data Store.
- Interactive system needs to be available (or easily set up w/ clear instructions).
- A live demonstration of this is required during the presentation, and each team will need to show the data for a specific species/timeframe/model as requested.
- Clear instructions for someone to use this system need to be provided.
4. Client Presentation (In-Class | Due Dec. 12 @ 8AM)
Introduction
Team members, roles and responsiblities
Overview of project
Overview of results
Details on:
Data: obtaining, cleaning, about the final dataset.
Analysis workflow: overview, obtaining, how to use
Benchmarks: Scaling, full analysis time
Results and result management
Final user interface for interacting with results
Code:
Where is it?
What does it take to rerun it?
Show an example of making a modification of code an rerunning
Documentation and training materials
How would a new person take the code and get to work?
How would a person take the code and swap out the SDM for another model, set of models, or set of parameters?
Live demo of final user interface for interacting with results
- We will be asking each team to demonstrate for one of the species. Which one? Thats the surprise!
5. Teammate Evaluations
- Form: https://docs.google.com/forms/d/1KCPRipVhHlRrbheXKO9knPqYW-4_0b_Tz5ICH2B0u9k/viewform
- Review each member of your subteam, and your "project management" person(s)
- Feel free to also leave evaluations for any team member, or any member of the other team who you might have worked with/helped/gotten help from!