This box searches only this space. The box at the upper right searches the entire iPlant wiki.

Skip to end of metadata
Go to start of metadata

The iPlant Foundation API v1.0


iPlant Foundation API has been replaced by the Agave API. Agave is the next generation of the Foundation providing complete feature coverage, better performance, a consistent url structure, better security, and a multitude of new features and capabilities. For complete documentation on the Agave API including tutorials, live documentation, sample data, and client libraries, please visit For more information, please contact



  • No labels


  1. Matt, thanks for posting this.  As we began discussing in the CI meeting, there is a question about the underlying functionality that isn't addressed by this API doc but that could, when addressed, impact the API.  Getting back to the basic need for the API, it was my understanding that it would serve to provide http access to iPlant file storage.  Previous discussions led to a decision that user owned data would be added to iRODS and then replicated to another iRODS instance.  Since the DE is hosted at UA, we've used the UA iRODS as the write location and the TACC instance as a mirror.  To date, we've relied upon the features of iRODS to handle transfers and synchronization and have written services to be an abstraction that encapsulate the Jargon (iRODS java API).  This has been in place for the DE since last Spring.

    I realize that the DE isn't all of iPlant and that we can't just make all files start at UA just because the DE lives here.  Still, we do need to ensure the integrity of the data.  So, my question is this.  How does the API ensure that a person writing to a file on the UA iRODS via this API (rename, etc.) doesn't end up with a conflict with changes to the same file made by someone writing through the TACC instance of the API? 

    One solid pattern for handling this is to create a master service (or proxy as Nirav called it).  This service would be the single point of access for requests and delegate to the approriate instance of a lower level service to perform the task.  In this pattern, there can be only one master.

    As a note, any changes made to files at TACC won't be propagated to the UA instance of iRODS until the configuration changes are made to make that happen.  As Edwin pointed out, current UA iRODS propagates changes to TACC, but the reverse isn't true.  Another note is that until we have a reliable means to ensure that the files stored at TACC are consistent with what is stored at UA, we won't be able to execute analyses at TACC.  Retrievals for viewing files come from the UA iRODS due to the proximity to the DE application.  If we don't resolve the data consistency issue before running analyses at TACC, we could end with a scenario where a user modifies a file through the TACC instance of the API, another user uses that file to run an analysis at TACC, and then that same user views the file through the DE where they would see something different from what was used for the analysis.

    1. This is a good question...   At the moment, I believe Rion's reference implementation is in fact pointing to the UA iRODS instance (and then things should get mirrored to Corral), so we don't have an immediate issue with testing this.  

      We do need to settle the long term strategy here, however... and the most immediate question is does this impact the API syntax?  We can always change the storage layer under the covers without affecting the API, *unless* we think we want to expose something about a multi-site structure in the API semantics themselves.  My first thought is that we probably don't, and this won't impact the API spec, but I haven't thought it through thoroughly...   Other thoughts?

      1. Using the UA iRODS as the primary storage target for the foundational API seems like a reasonable plan for the medium term. As long as everyone agrees that the UA iRODS is the master storage for the /iplant/collections path, conflicts shouldn't happen due to a multi-site structure. I also don't see a good reason why we should expose a multi-site structure in the API syntax.

        Incidentally, iRODS has the ability to select the storage resource for data based on criteria (e.g. collection/path, owner, file size, etc), but we have not implemented this across sites. Utilizing this functionality in iRODS would seem like the good approach, particularly as we increase the use of iRODS across iPlant's different's applications/services.

        Independent of the foundational API, the one concern I have for the long term – the concern echoed in Sonya's comment – is if/when we get to a point where a subset of data is stored at TACC because UA does not have the pentabyte storage capacity that TACC currently has. We will need to identify the issues surrounding this storage disparity, such as when and where should we partition the data and how will applications deal with multi-site structure.

        Maybe this can be discussed as we develop the data management policy?

        1. I'll follow up Edwin and Rion's comments by agreeing, essentially, that we don't want to have two separate "instances" of the API writing to the same part of the namespace either at TACC or UA; the model should be that we have one primary data store location and one replica location that is completely managed by the iRODS component for each instance - beyond that it doesn't matter where it's located, either physically or within the directory hierarchy. Otherwise you're just reimplementing replica management on top of a replica management system.

          Long-term, I would like to have a single iRODS instance with replication of the ICAT(database) for the data management, and not trying to deal with different zones or managing replicas at the data API level will facilitate that overall scheme. This single iRODS instance will need to live partially on that separate master node with significant, fast storage that Rion mentions, although it would have at least one offsite backup system and it we'll still want the underlying implementation to be able to handle multiple masters and have some failover capability for the worst-case scenarios.

  2. Distributed data management is always a pain. There's never a really good general way of handling it, it always just boils down to understanding your usage pattern and QoS goals and architecting a solutions that will meet or exceed your best guess estimates. The good news is that the Foundational API implementation (which I'll just call api out of laziness) isn't concerned with the underlying irods implementation at all. The api interacts with irods through jargon and leaves the data sync integrity issues for a lower level of architecture. This is a pattern that will carry through all the api services. IMHO, we don't want users to bother with system details. It's more manageable over the long term as data, compute, networking, and instrumentation resources come and go to keep things abstracted from the user. I feel kind of silly saying that because I know I'm preaching to the choir. I just keep reminding myself that the goal is for the user to look at the I/O service and sees an online storage box with a lot of value added features (and supporting services) that make conducting and disseminating their science easier. I'd be worried that we had missed the mark if they instead saw it as another irods front end. I think we have a shot at being a first rate iplant api and, at best, a second rate irods api.

    About the sync issues. Sonya is right about the proxy, though we may need to figure out how to handle a situation where the proxy goes down. I also think we need a master irods server outside of TACC and UA with sufficient space to cache the data that comes in and out until it can be propagated across the primary and backup servers. While a very lightweight VM running mod_proxy could serve as the proxy, I vote for a server with 5-10 TB of SSD to serve as the irods master, pushing to corral as the primary and UA as the redundant server. If either the primary or master goes down we can just rotate roles until the missing server comes back online. As I understand it, irods can handle this through policy and some micro services with a bit of additional massaging on our part. Edwin and the wonderful Dr. Jordan would be the ones to elaborate on this, though.

    Just a couple bullet point notes because this is getting too long:

    • If the DE uses the api for its data interactions, there is no problem with locality of data.
    • I vote for TACC to be the primary irods instance not because I work there, but because that's where the hpc systems we're using are located. Thus, that's where any science of reasonable size will be done. If we were using Abe and MSS at NCSA, I'd say use their storage system. Even if I weren't thinking of the larger architectural picture here, the last thing I want to do as someone trying to get good throughput and good performance in my experiments is make it policy to store data away from the compute resources I'm using. The best thing that could happen is everything running slower.
    • Did I mention we should have a separate irods master node with a bunch of disk?
    1. I agree that there is no one good way to handle distributed data management, but there are well known best practices, design patterns, and standards that when applied do make it easier than starting from scratch.  If I am understanding correctly (combining responses from Dan and Rion) it is intended that there be one instance of the I/O API pointing to UA iRODS and that replication and synchronization are handled by iRODS.  If this is the case, given that it is doing basically the same thing internally that our existing services do (i.e. interacting with iRODS via Jargon), as long as the API matures to the point that it addresses the concerns that our existing API does (e.g. better security) and those that we are in the process of improving (e.g. more standards compliant MIME type handling), then we may be able to replace the services we currently use with these.  I would prefer a less conversational API for the purpose, but if we co-locate the single instances of these services with our DE deployed at UA, then the conversational interaction does become less of an issue.  I'll leave it to other reviewers to discuss the details of what is already implemented so the gaps in this API can be identified.

      On the Master Data Management topic of which storage to make master and which to make secondary.  The relevant best practice is that the copy that is most susceptible to modifications should be the master and secondaries should strive to be read only.  Since the vast majority of our users will modify their user data through the DE client application, and since writes made within the DE are more reliably made to co-located data instances (since network failures outside of the local network of the DE application don't come into play), then it is only logical that the UA storage be the master.  Job execution will occur at both sites, but that is irrelevant since it is a read only activity on the user data and requires writes on the generated data products only.  It is much easier to synchronize data creation than modification.

      A suggestion for ensuring they don't just see it as another iRODS API would be to lift the level of abstraction.  It is currently very technology driven which is notrious for not being user friendly. This gets back to issues related to the conversational aspects but also has to do with understanding your user, what they are trying to accomplish, and making that easy.

      Rion's suggestion that we support three-copy replication is certainly a data center best practice.  Which would be master really depends upon usage patterns with writes being the most important to consider. 

      And finally, I want to point out that the physical grouping of files that this API supports isn't needed by the DE.  Not to say that it won't be useful for other applications or that it would be a hinderence, but the DE uses logical file groupings (a best practice for applications that are a part of enterprise systems) so it isn't likely to be of much help in exercising that aspect of the API.

      1. It's important to note here that the location of the "master" and the location of the storage are completely independent issues; if the DE hosting all happens at UA, there is a strong argument to be made that the master DB and iCAT instance (i.e. all technical and other metadata management) should live at UA; but it can control storage resources located at TACC just as easily as those at UA, and in an ideal world all the data will be replicated at both sites anyway. The unexplored territory here is how well this particular database replication might function over the WAN, but for the foreseeable future I don't see this being a huge issue.

        1. If there were a +1 or like button on confluence, I would use it Chris' above post. I also want to emphasize the fact that iRODS was designed for data federation through a single iRODS client interface. This means that it is not necessary for one iCAT/DB (i.e. site) to be the "master" of all the data; in other words, we can mirror most data, but we can also partition data to different site-specific storage repositories, if it made sense to do so.

  3. What about using a SAML token (via Shibboleth or just straight SAML) for the authentication instead of the user passing along their credentials? I realize that you'll want the username defined so that it can be used for building request URLs. And I know this is a reference implementation, but it appears that the user is offering up their credentials in plain text for anyone sniffing traffic (which is far easier these days). Maybe the "https" was skipped/omitted. I just wanted to bring it up.

    The other question I heard from Seung-jin was the versioning of the API. To paraphrase Matt (Vaughn), we'll be lucky to get user to learn/use the API once. But if we break it, the community may not be forgiving. So we want to guard against breaking the API with future releases. This is a large topic, I know.

    1. Please note that I understand ingesting resources, like noted above "(ie:", will require the user to pass the information in plain text. But I would hope that we would treat their credentials more appropriately.

      1. Personally, I'm always reluctant to give my authentication information to a third party, so I'd be inclined to completely avoid uploading files from URLs where authentication is required. We might want to find some alternate ways of doing this that don't require the user to provide authentication information or to temporarily stage the file somewhere where authentication isn't required. Would it be reasonable to allow users to PUT files into the service? It would be necessary to accept multipart messages so that the service could accept the file metadata and the file contents, but it should be possible.

        1. Dennis - I think the usage scenario is that you have a sequencing facility that provides you a URL with your short-reads (FASTA,FASTQ) and the design is indirect to avoid having to pull those files (& directories) back to your local machine. In order, "Hey Discovery Environment, go grab my sequence data from here" and then the asynchronous data transfer takes over. But that's just my assumption. So that username/password might be a shared set of credentials and I believe that's just using Basic or Digest, so that's not to strong.

          But for our API, I would definitely want to avoid compromising credentials.

          1. That makes more sense to me. If the credentials are shared anyway then providing them to a third party might not be as much of a concern. The scenario that I had in mind was that the user had a file in some repository in which the user's credentials to that repository were required to access the file. In that case, downloading the file to the user's computer in order to transfer it to iPlant's repository isn't optimal, but I considered it to be preferable to requiring the user to provide us with their credentials. On the other hand, people trust web sites with private information (for example, credit card numbers) all the time, so this might not be that big of a concern as long as the information is sent to our API over SSL.

            I'd still like to suggest allowing users to upload files directly into iPlant's repository, however; the ability to do that might still be useful if the files already exist on the user's computer.

    2. Good point; we'd definitely want to require the use of SSL for these services, especially if HTTP basic authentication is being used. Another reason to use SAML is that it provides a degree of separation between the user's credentials and the authentication information that is sent with every request. If we use SAML then the user's credentials only need to be sent to the identity provider; once the user has authenticated only the SAML assertion (or the information needed to retrieve the SAML assertion) needs to be sent to the service. The benefit of doing this is it limits the number of times that the user's credentials are sent over the wire, which reduces the number of opportunities for an attacker to obtain them. An attacker can still hijack a session by obtaining a SAML token, but the damage of such an attack can be mitigated by assigning SAML assertions with short lifetimes. Of course, the risk of a user obtaining a SAML token in the first place can be reduced in the first place by signing and encrypting the assertion and using HTTPS.

      On a related note, HTTP basic authentication will not scale if we join a SAML federation. It would be infeasible (and defeat the purpose of joining a federation) to create an iPlant account for everyone in the federation who wants to use the API.

      1. Those are good points Dennis. I mean, those benefits are generally what people want with the adoption of OpenID/OAuth. But OpenID & OAuth are not good candidates for us because they do not provide us with a Level of Trust about the person's identity. With Shibboleth (w/ or w/o a federation) gives us a Level of Trust that a university has validated their identity. I know that Rion, Dennis, Sonya, and I talked about this back in Feb. of this year when we started looking to integrate tools between TACC & UA.

      2. @Dennis: The production version of this will use SSL. The current instance is just a development machine. As for adding more methods of authentication, I agree that we should in the future. But HTTP Basic gets us over the threshold today.

  4. I saw an early version of this and noticed that user names and passwords are being transmitted as plain text using http. I thought it was a typo. Is it, or are we really doing that?

  5. I agree that sending user/pass in clear text is really unwanted.  I believe that using curl's "--digest" would help with this a bit.  My preference overall is to use https or saml for authenticating users.  Because files are being pulled from various sources, using clear text may be the only option (requirement of the source).

    Would it be possible to get read-only access to the git repository so that others can view the samples?

    I also forsee API versioning as a requirement so that we can depricate the old but still allow users to use it for a time period while converting/upgrading to the new.  If we force users to upgrade without deprication, we are likely to lose them.

    It is my understanding that iRODs can be used to do multi copy synchronization between geographic sites.  I don't believe you have to specify a "master", just the sites it will sync with.  Thus allowing ALL sites to contain the same data when needed.  Users would always be able to connect to the site closest to them and should that be down, be able to connect to the others.  Edwin, please correct me if I am wrong here.

    1. @Jerry: iRODS can synchronize collections in one direction much like the rsync unix command. To perform bidirectional synchronization, one would simply synchronize in the other direction.

      On a side note, it is important to understand that there are still limitations with iRODS. For example, we cannot inherently determine in-flight files (files that are in process of being transfered). This presents challenges for applications and APIs that might be performing their own mechanisms to track in-flight files. One solution would be to create a master services/proxy, much like the one that Sonya suggests in an above comment, that could provide consistent mechanisms to deal with these limitations for all applications and services.

    2. @Matt and Jerry: Right now, because the authentication part is not done, we are passing username/password in plaintext. While this should be a supported course of action for the user, I agree that it's not the optimal behavior for us to encourage. It looks like a more mature approach to authentication will emerge from the discussions posted here.

      As for the code samples repository: I'm putting one together now. It was mentioned as a placeholder in the documentation.

  6. I just gave this a closer read than I did earlier (when I was looking more at documentation style and expression-related concerns rather than content). I don't have experience using RESTful services, so these comments come solely from my understanding via reading, which may be faulty.

    • A pattern I have seen often is the use of an API key that is generated and associated with a username. This allows for monitoring and policing (so we can produce the metrics for usage, and revoke the key in cases of abuse). It also means that someone acquiring this does not compromise the user's credentials. I have never seen any suggestion anywhere that plain text communication of user credentials is ever a good idea...think about things like FireSheep . We should take better care of our users and their accounts.
    • Perhaps using Content Negotiation via HTTP Accept headers might eliminate the need for the /data endpoints altogether.
      Downside - there aren't MIME types for the data we're dealing with... but we could invent them while following standards. Also, with Content Negotiation and MIME types, you can specify the format that you will accept in a response using an Accept header, like this:

    Here, the priority to each format is specified using q=. Acceptable values are between 0 and 1, with higher numbers having higher priority.

    W3C has an excellent discussion of this at

    Ideally, conversations (multiple requests with multiple responses to complete a task) should not happen within an API, but rather single requests/responses. It seems that this method would help.

    I generally consider IBM's documentation to be quite good. This one is specific to one of their technologies, but the initial discussion may be valuable.

    Here is Apache's discussion of the topic:

    1. I think the notion of using an API key is handling since you have a scenario like:

      Suppose I write a script against Flickr's RESTful API and I have an infinite loop that just wails on the Flickr server. This is likely to be since as a Denial of Service attack. My API key could be deactivated. Flickr may allow me to still login then to my account; with an optional notice presented with why I had be API key deactivated and contact information to follow-up with.

      For us, we would deactivate the iPlant user's LDAP account. Our only recourse is to email the user with their email address from LDAP.

      Now, an API key might be helpful to enable us to do fine grain monitoring. But you could likely do that by some data mining on the access logs since the username is in the URL. So monitoring is an option for many RESTful APIs, but I think that's because they do not use the username - but the API key. There is a nice abstraction to having an API key instead of a username. Yet I would acknowledge this is an issue that can be debated.

    2. Oh - and I'm all for content negotiation. In review other web service proposals (like PhyloWS), I noticed the absence of Content Negotiation. This is a really handy way to use HTTP to help the user specify their needs without adding to the number of services exposed (less maintenance coding, administration, etc).

      1. On a similar note. I think it would be useful to use a custom MIME type in the Content-Type and Accept HTTP header fields to specify the format of content. A parameter could be used to designate the format version.

        Something like this:

        I'm doing that off the top of my head, so the syntax may be off. The intention is hopefully clear, though.

        If we're using SSL and SAML tokens to enable API access to certain users, then it would probably be okay to accept HEAD requests on content URLs. That would allow clients to quickly get information about a file without having to perform a full GET request and parsing the JSON that gets returned.

        If it's not possible to use MIME types in the HTTP headers, then I would strongly recommend placing the format version in a separate field in the JSON rather than appending it to the name (as in 'FASTQ-Solexa-0'). I had to deal with parsing version information out of RPM filenames when I was doing maintenance work on the up2date backend at Red Hat, and it can get really annoying, really fast. We should spare API users from that pain.

        1. This is a good point John. I would think we'd want to use a file format indicator that makes us a "good citizen" of the web. And I think we'd like to encourage the community to use MIME types. Rion mentioned that the FASTQ Solexa format doesn't have an acknowledged version, but it sounds like the format is subject to rapid change given how quickly UHTS machinery/technology is evolving (or would it be innovation?). But determining a MIME type to use would be something we'd want community feedback on. I understand that we want to promote the use of standards within the community and making use of MIME types is a great way to promote this.

          I also agree that if the user is given back a listing of supporting formats, that listing should be easy to parse and consume. Nearly all mainstream languages have a JSON library, so giving them the metadata/response in JSON would be helpful. I understand that this is more bytes over the wire. I just think this is a good compromise when you consider the user will find it easier to use.

          Quick aside: I noticed that NCBI uses the MIME type 'chemical/ncbi-asn1-binary' [1] for structures. I have no clue if that would be valid for an amino acid representation. I just wanted to mention that NCBI is attempting to communicate file format by MIME type. So my comment appears to have some precedent

          All that said, you could construct something like this:


          1. I found an EMBL mime, chemical/x-embl-dl-nucleotide, but not one for an amino acid. I have above listed a MIME that is not recognized. By any resources I found searching.
          2. Regarding ClustalW, you could have an alignment (.aln) or guide tree (.dnd) produced. Given that we're talking about amino acids, I'm making the leap that the alignment file is the file format we're talking about [1].
          3. PHYLIP is one of the formats that presents a gap in that it does not appear to have a MIME type associated with it's genetic file format. I believe the PHYLIP tree format can be classified as biotree/newick or whatever is used with newick tree files.
          4. Stockholm format does not appear to have a MIME type, one might assign something like the above to it. Or use something like application/x-hmmr, application/x-pfam, or application/x-rfam

          Some reference:

          Is "ASN-0" here different than the ASN.1 file format standard? I'm only familiar with ASN.1 and the various representations that it allows (ASCII, binary, XML).

          1. @Andy and John: An early draft of the API (the one against which the VAPrototype webapp was built) explicitly supported transmission of content type and used MIME types. MIME type was a field in the file record returned by IO listings, and we transmitted a Content-type header when the file was retrieved. In the situation where we don't have MIME types to match our data types (that's the majority case) we can at least transmit text/plaintext to give a browser or other consumer application a hint about the nature of the file. Since the majority of file types in the current collection of ~45 resolve to text/plaintext, we decided it was not super useful to include it in the current specification.

            @ASN-0: This is the definition for the ASN.1 format specification, which is used (Annoyingly) by NCBI. I don't know if anyone else actually uses it in bioinformatics. I've gone >10 years without it.

            @Andy: I do like the idea of explicitly specifying the version number in the JSON. I think we had that in there and it may have just fallen by the wayside.

            1. @MattV - I understand that a large number of file types would end up in the text/plain, but I was trying to point out that it would be beneficial if there were MIME types for those [1]. The short forms that are suggested for use will look like we're inventing our own file-types, which I imagine we might take some heat for [2]. If there were MIME types, then the I/O services could use Content Negotiation for retrieve like so:

              The Accept would provide the context for the requesters desired format and avoid having to make several service calls to determine the support file format transformations and then, finally, a call to request that transformation. What does the code for the requester look like? A couple of calls, parsing responses and several conditions to determine what to do next? Where using content negotiation, you make one call and then react to the format that you got back. The catch being that you may not get what you want, but you'll get what you need in that the content is there in plain text.

              I understand that the above may make the implementation more complex, but it has the added benefit of making the user interaction more simple.

              I understand most people, especially bioinformatics folks, hate ASN.1. NCBI's James Otell did defend why the picked ASN.1 in an article years back, and it would have made sense if the file format could have been abstracted away from the users so they didn't need to worry about ingesting and outputting it. I know there is no love for ASN.1, but NCBI group had their reasons and most people disagree.

              [1] - My understanding is we want to foster "standards" usage among the community. I know this slows the process down. But I would imagine that we could provide "suggestions" and work with the groups responsible for the file formats and see if they agree or have other ideas. I just feel like because they don't exist is not a reason to adopt something that is different from a common approach on the web. I know the short-form ID is used in the /data endpoints URLs - but I'm just trying to put out another approach that would eliminate all but the asynch transform request (but a callback URL in the I/O might even eliminate that).
              [2] - I still get heat for services that I didn't even create while at The services were not compliant w/ general web conventions and invented their own conventions.

              1. I have no trouble supporting Accept headers, but they are merely suggestions about what to do. If they are our only method for content transformation, we're trading an imperative model for a declarative one. But all the other functions of this API are imperative - make a directory, delete a folder, upload, chmod, etc. This is because we're providing commands that present a type-aware file system over REST.

    3. Dennis, please correct me if I'm wrong. But with a SAML token, we still have the chance that someone using a tool, like FireSheep, could hijack a token. But the time-to-live on a token would help in minimizing the damage done by this. With Basic Authentication, we can minimizing the damage done - but that depends on the user changing their password. And they may not know their credentials have been compromised, so that is less likely to happen.

      1. True. We can reduce the risk of a user hijacking a session by always using HTTPS. Assuming the encryption keys aren't compromised, the most an attacker can do is replay packets if we're using HTTPS. The time-to-live attribute of the SAML assertion reduces the risk of replay attacks by limiting the amount of time that replayed packets will work.

        1. I just wanted to acknowledge that I'm not assuming an encrypted SAML token is 100% secure. I agree - putting that over HTTPS + SAML token is a solid option.

    4. Just a quick thought: some other RESTful APIs I've used also ask for info about the tool making the request (i.e., NCBI's eutils). Mandating this might allow tracking of third-party tools, so you can see how the service is being used and know which developers to contact regarding problems. Of course, this could be spoofed, but even easily-spoofed things like a website visitor's user agent are useful in aggregate. Some APIs also mandate creation of a developer key for use (i.e., Google Maps).

      But I'm a biologist, so please feel free to ignore this post if it's a bad idea -- don't want you to waste your time explaining something that's obvious to those with more experience in this area.

      1. Brian, definitely a good idea. I'm interested in metrics and accounting - Rion and I are talking about something like this for a point release in the future.

  7. I'm trying to utilize this API through a browser, but I'm getting Access-Control-Allow-Origin errors (server is not setup to allow cross domain AJAX requests). Can this be turned on? You just need to add the header: 'Access-Control-Allow-Origin: *' to responses to allow access from all domains.