At Wet Dog Weather, we process and store a lot of weather data. This includes using Zarr with AWS S3 to efficiently manage large, multidimensional arrays. We currently lean towards real-time weather display and query. However, our customers are nudging us towards displaying medium- and long-term forecasts. The point being that anything we hold has to be immediately accessible.
Thus, our goal is that anything we hold, whether it be radar or a weather model, is accessible for display within the 10s of milliseconds range. For query, we aim for something similar and have recently achieved it with some work. That’s a future blog post.
Custom Data Formats
We really like Cloud Native data formats. These are large data storage standards that can be accessed directly from block storage. In our case, that’s Amazon S3, but they all work in similar ways.
For this class of data formats, we approach them from the inside. We’ve been doing this for years, but in a rather hack-y, non-standard way, we won’t bother to share.
One example is our fast-access DataPak format, which is used by our data tile servers. It’s just a zip file with a separate header used to represent a data tile pyramid. When Terrier requests a specific data tile, it involves two GET calls to retrieve it.
Now you could do the same thing with a ‘directory’ full of PNG files. That will cost too much money to upload and generally be a maintenance nightmare on S3. If you know, you’re nodding along. If you don’t, well, that’s the kind of weird problem you have with Cloud Native.
Using Zarr with AWS S3 for Weather Data
We transitioned to using Zarr with AWS S3 for data storage (but not display) a while ago and are generally satisfied with it. With version 3, we were able to significantly reduce our S3 Tier 1 Request costs. What are those? They’re upload costs, but specifically the number of objects you upload to S3.
This is where things get interesting and never quite stop being interesting. Zarr is best suited for storing multidimensional arrays, and you can structure them as you like. That will result in a grid of files for each of the ‘chunks’ you define. That can be a lot of files, and S3 doesn’t really approve of it. S3 expresses its disapproval through cost.
With Zarr 3, we can group those chunks into shards and avoid S3’s disapproving eye. In most cases, we create a single shard for the entire data file. That significantly reduces costs in our use case, one of the main benefits of using Zarr with AWS S3.
Using Zarr for Quick Access
This approach was ideal for data processing, storage, and a specific type of retrieval. Specifically, the kind where a customer requests a large chunk of one variable (e.g., wind speed) for a single time slice (e.g., 5:00 AM).
If the customer requests 200 forecasts at once for a single latitude/longitude, then things get interesting again. Accessing 200 files takes… a while. Often long enough to time out, so we had to restructure things.
Because we’re using Zarr with AWS S3, we added a time dimension and stored each variable (e.g., wind speed) individually. Now that giant query takes under 500ms.
Organizing these merge files is annoying given the way forecasts trickle in. I like to say, “If it were easy, no one would pay us to do it.”
Reflections on Zarr and S3
Having worked with AWS S3, weather data, and two distinct access patterns, I have some thoughts to share. These are heavily domain-dependent, which is very much the case with Cloud Native solutions, I think.
S3 doesn’t like a lot of small PUT operations. Anything you can do to avoid that may be worthwhile. We measure that need by cost.
We no longer write to S3 directly. We write our Zarr file to local storage and then upload it to the cloud. We’re staying in the GB range, so this might not work as well if you move into TB.
We still use different files for display. Zarr is great, but it requires about twice the number of GETs to accomplish the same thing our goofy DataPak format does. That format isn’t superior to Zarr, far from it. We know what corners we can cut for ourselves.
We use two distinct Zarr structures for processing and querying weather data. If we didn’t, it would be very messy and take a lot longer to produce the files. We’ve found that using Zarr with AWS S3 works best when we match structure to access patterns.
Final Thoughts on Using Zarr with AWS S3
What I loved about using Zarr with AWS S3 is not that it’s super clever. The algorithms are straightforward, the implementation solid, and it’s all very recognizable. If you’ve been working in this area for a while, it’s a comfortable approach you can recognize.
The beauty of a Cloud Native format lies not in its cleverness, but in its standardization. It does something we’ve all done here and there, but it does it consistently. Then the proponents go out and talk a bunch of other packages into supporting it.
Zarr is not going to solve all your problems, but it does help with many of them. We’re keeping many of our specific solutions, but it’s nice to add using Zarr with AWS S3 to the mix.