Context
We want to decouple schema-management concerns from Lily and improve our user’s (include our) ability to consume that same schema.
Problem Statements
- TODO: Add other related problem statements
- There is currently a lot of data duplication in Lily’s data model, which in turn, leads to large tables.
- From @Forrest Weston:
For example, there are ~1billion miner sector event entries (currently in prod mainnet database), each entry contains a stateroot reference. A CID is ~64 bytes. So there is roughly 64 gigabytes of CIDs in that table alone. Add that together with the rest of the tables that have state roots: at least 250 gigabytes of stateroots CIDs alone. This applies to message CID’s and block CID’s as well.
Proposed Solutions
Fully-relational, read-optimized schema
Schema-less Lily
Notes (just drop things here they come to mind)
https://github.com/polarsignals/arcticdb
- Some recent discussion in Toronto:
- Discussion around validating that data conforms to a schema is a piece missing from our current backfill CSV extraction.
- Schema validation is necessary for protecting the guarantee that the intermediate format may be consumed by some well-specified serializer (and fed into some custom process).
- CSVs do not have opinions on schema and may allow data which does not meet constraints which would normally be expressed in a schema definition.
- Plans to use some intermediate format
- Importance of human readability of intermediate format