Re: Beyond another cloud: data service discovery for NDSLabs

On Mon Nov 03 2014 at 9:26:41 AM Arthur Smith <apsmith@xxxxxxx> wrote:

Matt,

Â thanks for such a thoughtful response. I certainly don't know myself exactly what will work, but the focus on answers to specific questions in data handling sounds to me like a more fruitful path.

Â To explain a bit where I'm coming from, I've recently been involved in the FORCE11 data citation implementation group (DCIG):

https://www.force11.org/datacitationimplementation

and in particular with those looking at what it means for a dataset identifier to be "machine-actionable" as the "joint declaration" held in principle #4:

" A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community"

This is really on your third question, "where can selections of data be obtained". But maybe it feeds into the others as well.

The question that came up with DCIG was, suppose you have an identifier for a dataset. What can an automated system reliably do from that point? Suppose the user goal is to load some selection of the data into an application - what would it take to do that with the least amount of further user interaction if you just start from the identifier?

One consensus seemed to be the identifier (DOI, say) should resolve to a landing page that provides (for appropriate queries) some form of metadata about the dataset. There's a (in progress) Google spreadsheet the group has been working on here that compares different dataset metadata standards that are out there:

https://drive.google.com/folderview?id=0B-3fjDTO3dDaRlJWSzZFYlJUZTg&usp=sharing

So there is some hope that metadata of this sort can be expected to be available after resolving a dataset identifier.

Machine actionability then to me could imply
ÂÂ * application (an interoperability layer, not necessarily the final application for analysis) accepts identifier (eg. DOI) from some sort of user interaction, for example following a citation.
ÂÂ * application resolves DOI and receives metadata
ÂÂ * application could display title, date, size, format, etc. to user and ask for confirmation
ÂÂ * application may download data from "file location", unpack (to multiple files - a zipped "bagit" format was suggested - or maybe OAI-ORE)
ÂÂ * --- but - then what? It needs to know something more about the data to do anything useful with it. If there is some way for the data files to be linked to something like the RDA data type registry, then maybe at least some of it could be pulled in directly to the application. Or else further user interaction would be needed to select specific files, columns, etc??? But that's making it hard for the user again.

I agree with this completely, and I think that the "but - then what?" item is perhaps the most interesting one.Â I'd rather we *not* provide an answer to this from the perspective of the infrastructure within which applications can run, but instead determine a matchmaking system for data.

(If I could bluff, I would pretend that this is what I had in mind when I brought up service discovery ... :)

What do you think about combining service discovery with the Datatype Registry for matchmaking applications to data?Â I'd rather we supply the ability for applications to fail than try to cover every possible aspect of their success.Â As a concrete example, imagine that applications get spawned, and they register themselves as working with a given datatype; data gets inserted into the system and either during a tilling step or as part of the ingestion, it's identified as fitting into a given datatype from the DTR.Â When the data is selected to be acted upon, the available services would be returned.Â In addition to this, we could provide standard services as well -- generic Python, R[Studio], shell, etc data manipulation methods.

This problem has essentially been solved for some datasets like IVOA with things like the VOTable standard:

http://www.ivoa.net/documents/VOTable/20130920/REC-VOTable-1.3-20130920.html

(along with a lot of other components). Maybe it would be helpful to start from there (or another example) and generalize?

Perhaps, but I'm not sure we should approach it from precisely the same perspective.Â I would like to see if we can come up with an environment where individuals from both the data and application side can bring their own components, and then we allow that to work itself out.

Is this an interesting idea?

ÂÂ Arthur

(PS for everybody except Matt, if this isn't appropriate for this general mailing list please let me know!)

On 11/1/14, 2:06 PM, Matthew Turk wrote:

Hi Arthur,

Sorry it's taken me a few days to reply, but I've been pondering your email for a while and trying to formulate a response to it.

I think perhaps what I've been trying to get at in thinking about NDS Labs and trying to spur the conversation, is to figure out what's the best possible way to foster interoperability -- specifically, what can we do, now, to create an environment to explore and experiment.Â I originally thought we could do this by providing:

Â* Communication mechanisms

Â* Gradual growth and incubation of components that are connectable

Â* Simple start, complex end

Perhaps, though, approaching this from "service discovery" is the wrong way.Â Adding more indirection, reimplementing things that have been done before -- both are tricky, like you point out, and probably are best to avoid for the time being.

Service discovery in general may simply be too *big* a sandbox for NDSLabs, where individual projects and instances will likely number in the handful, not the hundreds, and where the N^2 process of developing interoperability is going to be relatively small.Â If we can standardize generic k-v pairs but can't manage service discovery without them, we probably are doing something wrong! Â:)

So let's try what you suggested, and hammer down on what the specific, difficult technical things are that we want to do, and then figure out how to implement them.

Here are the things that for a "next gen epiphyte" I know I would want:

Â* What are the possible applications I can send input data to

Â* Where can the resultant data be sent

Â* Where can selections of data be obtained

I'd also like to provide as much as possible in the simplest technology available -- which means that OAI will be a goal, but not the only conduit for data.

For epiphyte, we're also taking the tactic that (for now) the data is all in the form of files on disk.Â This won't work forever, but it will for the time being.Â If we were going to take those three "services," I think I would want to know a host/port, and some format for the transmittance of data -- perhaps even just REST posting, or a URI to a filesystem/filesystem-like thing.Â If we had that information it might be enough to provide some degree of interop at that level, with a basis of understanding we can build more complex ideas on top of later.

Does that help to focus the ideas more?Â Or is this still complexity in search of need?

-Matt

On Thu Oct 30 2014 at 2:17:34 PM Arthur Smith <apsmith@xxxxxxx> wrote:

That does sound interesting. However, it also reminds me of RFC 1925:

http://tools.ietf.org/html/rfc1925

in particular "6a - It is always possible to add another level of indirection. " and perhaps #11 as well... Lots of wisdom in the old IETF...

I really liked your talk about what you'd done with Epiphyte - in particular making hard things easy. Very impressive work. Is there some way to organize this by starting from the "hard" use cases NDS labs is trying to address, and drill down to the technology components really needed to make that happen? Discovery does seem likely to be a good part of it, but if it's based on key-value pairs (for example) how does the user know what keys to query, who sets the standards for those keys and meanings of corresponding values? Aside from knowing where exactly the etcd server or whatever is doing that work is. There's got to be some base starting point, a system that knows enough to help the user do things, can we work from there?

ÂÂ Arthur

On 10/30/14, 11:42 AM, Matthew Turk wrote:

Hi all,

In the other thread, Arthur brought up that we don't want "just another cloud infrastructure," which I think was really apt, and something that deserves thought for any NDS Labs project.Â So I wanted to start a couple topics about what can be provided on top of a standard cloud infrastructure that might be of use.

I'm wondering about discovering data services within a region, where that region is either some subnet on a cloud provider, or even more globally across locations.Â If we are thinking about interoperability of services, then there are probably a few verbs that could be identified as being necessary.Â If we can have services identify themselves as providing verb endpoints, that could provide an environment for testing interop.

Kacper and I have been experimenting with this ourselves, mostly looking at the various service discovery mechanisms that operate on docker containers being orchestrated across machines.Â Some of these do this via introspection, and some will even set up automatic (nginx) reverse proxies for docker containers running inside a system.Â Right now it looks like etcd is a pretty good solution for this:

https://coreos.com/docs/distributed-configuration/getting-started-with-etcd/

http://www.activestate.com/blog/2014/03/brandon-philips-explains-etcd

https://github.com/coreos/etcd

as it can allow for key/value pairs to be stored, and it's discoverable.Â For instance:

http://jasonwilder.com/blog/2014/07/15/docker-service-discovery/

I think having a discussion about what we want services to be able to do is perhaps a much bigger topic, but I wonder if this type of thing -- particularly etcd -- would be useful to any projects, and would be a good avenue for service discovery and intertop.Â Is there something else that would be better?

-Matt