This is a early beta release and users are encouraged to provide feedback
Rationale and background:
Understanding how your data store allocation is consumed storage space is being used is a key step in managing data.
This app provides builds a dashboard summary database of data in a given location (directory) in files stored in iRODS collections (such as the CyVerse data store. The application crawls through all the subdirectories and provides summary at individual files and at directory level including identifying duplicate files (by content). It only contacts the datastore once to import details about the files and directories; all subsequent searches are done using local database (making it much faster).
), Amazon S3 buckets, or directories on your device, and allows you to search, sort, and compare them. It provides information about file sizes, types, and duplicated files.
The user interface includes 4 tabs:
- Summary provides statistics of file types (by file extension), sizes, and creation dates.
- Duplicate files shows the duplicate files and their locations.
- Browse files allows you to browse the results of the crawl (like a file manager) and search for files using name, regex, date, and size filters.
- Import File Data allows you to query the data store for file information in a specific directory, replacing the last directory scanned.
Launch the application, click on the resulting VICE link provided by DE, and select the "Import File Data" tab and provide your information. By default, it should fill in with your CyVerse username and home directory (see the example below).
It takes few minutes to crawl the location and generate the statistics for the dashboardlaunch page offers five options for importing file data into DataHog:
- iRODS: Use the iRODS API to import data from a specific collection. The options for importing files from the CyVerse data store are prefilled.
- .datahog File: Upload a .datahog file containing file data. These can be generated by a Python script which you can download and run on any machine.
- CyVerse: Use the CyVerse file search API to import any data stored in the data store. This method currently does not support exact duplicate matching, and may be slower than iRODS in some cases.
- S3 Bucket: Use your AWS access keys to import an S3 bucket, or a specific directory from one.
- Restore Database: If you previously backed up a DataHog database, you can upload it to restore your data.
Depending on how many files are being scanned, the import process can take a few minutes to complete. Some extremely large directories (millions of files) may be too large for the system to currently handle due to network timeouts.
You can hover over a file with your mouse to get link to copy the full path in data store, which can be used to examine data or delete it using DE or icommands etc.
take much longer–feel free to close the tab and check up on it later if you wish.
Once the import process for your first file source is complete, you will have access to 4 tabs:
- Summary: View a summary of each of your file sources, including various file rankings and visualizations.
- Browse Files: Explore the folder structure for each of your file sources, or search your files using names, regex expressions, or date and size filters. Each column header can be clicked to sort the table by that value.
- Duplicated Files: View a list of files with identical contents. By default, this page uses checksums to compare files, but file sizes or names can also be used. Each column header can be clicked to sort the table by that value.
- Manage File Sources: Import a new file source, remove an existing one, or download a backup of the current file database.
Chris Klimowski (UA Data Science Institute: Data7)