Project Octostore A Universal Metadata Store as a Service

Author: David Aronchick

Note: The below is a summary of a more thorough analysis

Summary

Developers struggle storing information about the process of building applications. The lack of simple libraries, easy to use services, universal tooling make it challenging to build automated, repeatable and introspectable pipelines that are the core of any modern development practice.

The open source community can address this problem and significantly accelerate developer productivity and velocity by offering a metadata store as a service (MDaaS) server, companies that host such a thing, and the associated SDK. This service would enable storing information about development inputs, outputs, and workflow, which is currently “locked away” in source code and overly verbose/hard to parse logs. The benefits to the user include: 1) central storage of metadata independent of any cloud, 2) acceleration of developers and their data science/analysis via an SDK, and 3) unlocking new and more reliable automation.

Problems We Are Addressing

The development process is full of the reading and writing of metadata – information ABOUT the process (not just the artifacts itself). Examples of metadata include:

In summary, metadata generally falls into one of three camps: INPUTS necessary to start a step, OUTPUTS describing the results of the step, WORKFLOW describing how to execute each step.

Unfortunately, this metadata is often effectively inaccessible either lost entirely (due to lack of proper logging behavior) or, at best, in opaque blobs that require manual parsing. In order to enable proper use of the data about these processes, we need to build a simple SDK and a hosted store to make it incredibly easy to record, store and reason about the process of building software.

A real-world example comes from Microsoft building the Office suite. Office policy requires only trusted versions of libraries are used for ML model building for customer content scenarios, requires two code reviewers, and requires manual approval process. For each model produced, there needs to be programmatic validation about what data is being trained/tested on, how it is being split, what libraries were used, who reviewed/approved, the history of all model experiments and their metrics, the cost vs. performance trade off (at the time the model was run), that the model card auto-generated, etc. They spend hundreds of person years manually writing libraries for accessing this information, validating policies, and maintaining their own ad-hoc services to provide programmatic validation. Worse, these tools impose a significant cognitive load on their developers to include the correct logging information. Even when the metadata is written to a useful location, an entirely new team needs to spin up their own work to pry the information out and in order to take programmatic action on it.

Or, in the words of Netflix:

https://lh6.googleusercontent.com/iRySm0CEltUciXJmFJHrdegKJ_Fa3G4k4CoOQYSfeE2W7uppSxA-Rz-qdoyEf9lfRGTC8xCWYNXTE8Ix2S6uOIJ6_MuzHONVjrp8lMQcQ16YTQeYILmu1TGYtzS9uPlvq7Mayhi5S6ii5DpziyASH52fiQRk1PnJaLYq3nY41qJC_HX6QjxfKVHknr7XlRGKmvPB1qYSiw

Our Proposal

We can fix these issues by treating the problem holistically. METADATA, along with SOURCE CODE and PACKAGING, is the third leg of the modern developer toolkit and should be treated just as importantly. We need a solution that allows any organization to spin up a standards based metadata service  that is deeply integrated with common developer tools, but flexible enough to use anywhere (including other clouds and on-premises). A service focusing on this would address all of the above issues and more, unlock a new world of developer productivity, and solve key enterprise requirements that are unaddressed by commercial solutions.