Re: Beyond another cloud: data service discovery for NDSLabs

On 11/5/14, 11:43 AM, Matthew Turk wrote:

[...] I'd rather we *not* provide an answer to this from the perspective of the infrastructure within which applications can run, but instead determine a matchmaking system for data.

Wow. That's making a ton of sense - and resonates completely with that email from the RDA datatype registry group (did they really just send that out earlier this week?) It looks to me like the "About" and "Scope" pages at http://typeregistry.org were updated recently too, along these lines?

What do you think about combining service discovery with the Datatype Registry for matchmaking applications to data? I'd rather we supply the ability for applications to fail than try to cover every possible aspect of their success. As a concrete example, imagine that applications get spawned, and they register themselves as working with a given datatype; data gets inserted into the system and either during a tilling step or as part of the ingestion, it's identified as fitting into a given datatype from the DTR. When the data is selected to be acted upon, the available services would be returned. In addition to this, we could provide standard services as well -- generic Python, R[Studio], shell, etc data manipulation methods.

So I'm imagining something similar to the way mime types and associated applications are registered with web browsers right now. For each content type, I as a user have a default application to open it in (if I've seen that type before), but also other options available that I can select from or change to. Perhaps the DTR could be at root an extension of the content-type system? Except we're imaging the data handling applications registered not with the local web browser but through some sort of online discovery service. But I might want to add some more local ones of my own (like the python R, etc. examples). Still not entirely clear to me how this ought to work but it seems like there should be a way to get there.

I really like this idea, and I think it blends very well with what RDA is trying to come up with. So to drill down from this idea - what are the technology components we need?
* A DTR is one piece (perhaps organized a bit differently from the RDA example as it stands).
* Some kind of discovery service to link applications with data types they support.
* Another service to link datasets or particular portions of datasets (individual files, ?) to data types (or some way to represent that in the dataset metadata?) I'm thinking something based on the Open Annotation model perhaps?
* An interoperability layer that can link up a dataset with a default application, either online or local, or present a list of options, through the above services

This doesn't sound too overwhelming... Are there other pieces needed?

Arthur