It'd be cool to put a simple NBD front end on it with an nbdkit plugin. That'd let you trivially turn the immutable objects into Linux devices or use them as backing for qemu virtual disks. (https://libguestfs.org/nbdkit-rust-plugin.3.html)
s2 is one of the coolest technologies that more people need to be talking about - I'm still begging them to move one layer lower -> turning s2 into an incredible middleware for edge IOT deployments!
PLEASE if someone from the team sees this - I would pay so much for a ephemeral object store using your same edge protocol (seen in the sensor example from your blog).
This looks super interesting for single-AZ systems (which are useful, and have their place).
But I can't find anything to support the use case for highly available (multi-AZ), scalable, production infrastructure. Specifically, a unified and consistent cache across geos (AZs in the AWS case, since this seems to be targeted at S3).
Without it, you're increasing costs somewhere in your organization - cross-AZ networking costs, increased cache sizes in each AZ to be available, increased compute and cache coherency costs across AZs to ensure the caches are always in sync, etc etc.
Any insight from the authors on how they handle these issue on their production systems at scale?
Not the author but. Its a user side read through cache, so no need for pre-emptive cache coherence as such. But there will be a performance penalty for fetching data under write contention irrespective of whether you have single az/multiple AZ. The only way to mitigate the performance penalty here is to have accurate predictive fetching which works for usage patterns.
Assuming the "Designed for caching immutable blobs", I guess the approach is to indeed increase the cache size in each AZ or eat the cross-AZ networking costs.
Can someone explain when this would be good solution? We currently store loads of files in S3 and directly ingest them on demand in our Java app API pods. Seems interesting if we could speed up retrievals for sure.
The basic tradeoff is that you are paying an extra tax on all requests that are not served by the cache, so you something like this would help if you are reading the same data repeatedly. So, for example, a database built on object storage or something like that.
Frankly, any web app I develop has configurable in-memory caching built in to it, so I would rather increase its size than add an extrinsic cache. By keeping my cache internal to my application, it's also easier for me to invalidate keys accurately.
> If you have 100 instances you really want them to share the cache
I think that assumes decoupled compute and storage. If instead I couple compute and storage, I can shard the input, and then I won't share the cache across the instances. I don't think there is one approach that wins every time.
As for egress fees, that is an orthogonal concern.
An interesting approach to caching with hybrid memory and disk and support for any S3-compatible backend, but limitations may arise when working with large data streams and specific backends.
It'd be cool to put a simple NBD front end on it with an nbdkit plugin. That'd let you trivially turn the immutable objects into Linux devices or use them as backing for qemu virtual disks. (https://libguestfs.org/nbdkit-rust-plugin.3.html)
Check out ZeroFS (https://www.zerofs.net), which is using SlateDB (https://slatedb.io/)
ED: Now I catch your drift, it would indeed be cool. ZeroFS requires a commitment to the SlateDB LSM data format.
https://old.reddit.com/r/databasedevelopment/comments/1nh1go...
s2 is one of the coolest technologies that more people need to be talking about - I'm still begging them to move one layer lower -> turning s2 into an incredible middleware for edge IOT deployments!
PLEASE if someone from the team sees this - I would pay so much for a ephemeral object store using your same edge protocol (seen in the sensor example from your blog).
Cheers!
Hi mertletee, I'd like to understand the request better, mind dropping me an email? It's in my profile
This looks super interesting for single-AZ systems (which are useful, and have their place).
But I can't find anything to support the use case for highly available (multi-AZ), scalable, production infrastructure. Specifically, a unified and consistent cache across geos (AZs in the AWS case, since this seems to be targeted at S3).
Without it, you're increasing costs somewhere in your organization - cross-AZ networking costs, increased cache sizes in each AZ to be available, increased compute and cache coherency costs across AZs to ensure the caches are always in sync, etc etc.
Any insight from the authors on how they handle these issue on their production systems at scale?
Not the author but. Its a user side read through cache, so no need for pre-emptive cache coherence as such. But there will be a performance penalty for fetching data under write contention irrespective of whether you have single az/multiple AZ. The only way to mitigate the performance penalty here is to have accurate predictive fetching which works for usage patterns.
Assuming the "Designed for caching immutable blobs", I guess the approach is to indeed increase the cache size in each AZ or eat the cross-AZ networking costs.
Yes, that's how we are running it at s2.dev, auto-scaled per-AZ deployments. https://www.reddit.com/r/databasedevelopment/comments/1nh1go...
Not the author, but my suggestion is to use a real infrastructure provider. You will save tons of money.
Can someone explain when this would be good solution? We currently store loads of files in S3 and directly ingest them on demand in our Java app API pods. Seems interesting if we could speed up retrievals for sure.
The basic tradeoff is that you are paying an extra tax on all requests that are not served by the cache, so you something like this would help if you are reading the same data repeatedly. So, for example, a database built on object storage or something like that.
If you can have the cache onsite then it'll likely benefit many things just by virtue of not going through a slow internet link
Frankly, any web app I develop has configurable in-memory caching built in to it, so I would rather increase its size than add an extrinsic cache. By keeping my cache internal to my application, it's also easier for me to invalidate keys accurately.
It's about scalability. If you have 100 instances you really want them to share the cache so you increase hitrate and keep egress costs low.
> If you have 100 instances you really want them to share the cache
I think that assumes decoupled compute and storage. If instead I couple compute and storage, I can shard the input, and then I won't share the cache across the instances. I don't think there is one approach that wins every time.
As for egress fees, that is an orthogonal concern.
lets go!
An interesting approach to caching with hybrid memory and disk and support for any S3-compatible backend, but limitations may arise when working with large data streams and specific backends.
Thanks ChatGPT