Idea

Onboarding useful data to Filecoin is still work in progress. One of those large useful datasets is the Sentinel data from the Copernicus project. This is satellite imagery collected by the European Space Agency (ESA), similar to NASA’s Landsat mission.

Last week I’ve been talking to Markus Neteler an expert in the field that I’ve been knowing for many years. I’ve asked him how he’d like to access the data in an ideal world. Currently he gets it from Google Cloud as the data is in better shape (more on this later) and faster to get then from the official sources. Though he said, it could even be better and I think this is what we should be doing.

Current state

If you’re working with Sentinel data today, you’d typically get it from AWS or Google Cloud. It’s just more convenient and faster, than from the official sources. The ESA does atmospheric corrections on the data since 2018. The catch is, that they change pixel value scaling parameters from time to time. But when you do analysis with data, you want to have the same parameters for all your images in the time series. Google does that for you and have processed the data to have the same parameters for the whole archive (N.B. the NASA always reprocesses their Landsat data archive entirely, whenever they change the parameters. Thus they are way more friendly to their downstream users).

Google provides the data as JPEG2000, but it would be more useful to have Cloud Optimized GeoTIFFs (COGs). We could provide those.

Ideal world

Having an archive of the whole Sentinel data as COGs, all with the same atmospheric correction, accessible via HTTP. I can think of two scenarios getting there.

Using the data from Google Cloud

Getting the data from Google Cloud, converting it to COGs and dumping it into Filecoin. Except for the image conversion, no further processing would be needed (except for atmospheric correction if that's desired).

Getting the data from the official source

This is the way I’d prefer we’d be doing as I think it aligns better with Protocol Labs overall mission. We start with getting the data from the ESA, then do atmospheric correct and conversion to the COGs and storing those on Filecoin.

The reason I prefer it is, that we could get into reproducible/verifiable data (processing). If you download data from Google, you trust that it’s a copy of the original data, but you can’t be sure. If we create and open source processing pipeline right from the source, people can verify that the processing was correct.

How to do it

We start with the data from the ESA. We store those files directly on Filecoin. I don’t know of the ESA provides checksums of their files, but we could create those. We then process those files and store the result on Filecoin.

People could then verify the correctness locally. Download some file from the ESA and generate a checksum, then following our processing scripts, e.g. generating a CAR file, verify its CID, do more processing and verify again.

I think this verifiability is a huge thing and we kind of get it for free due to the content addressing. I even remember talking to someone from the academic community that needed to proof that the data they used for their analysis was based on the original satellite imagery. Back then they didn’t know how to proof it, other they just saying that it was the case.