BTrDB: Berkeley Tree Database

231 points by tectonic 6 years ago

As someone who generally finds "couple million rows? Sqlite is fine" works as a philosophy, what generally should I know here?

I get that there are huge timeseries based data flows going round the world (transactions, metrics, weather, self driving telemetry) and sticking those into something is useful.

I understood the world has settled on "use hdfs to store the data, use apache spark to run over those files and reduce your query data set till it is small enough to fit into pandas and then process"

This looks like it is trying to cut out the first two step?

gopalv 6 years ago

> what generally should I know here?
The "real useful" part was a click further in for me.
https://github.com/BTrDB/smartgridstore/blob/master/tools/c3...
Ingress endpoints which is usually the more annoying component, rather than the exact storage format.
> As someone who generally finds "couple million rows? Sqlite is fine
There's still unsolved annoyances with analog timeseries measurements, which aren't easily solved in SQL with a columns + btree index.
So combining an airplane altitude sensor at 0.1Hz with a 10Hz fuel sensor, starts giving you trouble in SQL because the question someone might ask is "when the plane was at 31,000 feet, at 0.85 mach, what was the fuel rate", while the plane might have gone from 30,999 to 31,001 feet without ever recording the intermediate point.
Indirectly speaking, you can do range-scans for each measurement if you consider time as something that is different from a generic column value in a generic SQL engine (to join rows by temporal proximity).
The ASF proposal from IoTDB[1] might make interesting reading on the topic (the frequencies part) or the Boeing talk from XLDB 2018 [2]
[1] - https://wiki.apache.org/incubator/IoTDBProposal
[2] - https://conf.slac.stanford.edu/xldb2018/sites/xldb2018.conf....
- SahAssar 6 years ago
  
  > So combining an airplane altitude sensor at 0.1Hz with a 10Hz fuel sensor, starts giving you trouble in SQL because the question someone might ask is "when the plane was at 31,000 feet, at 0.85 mach, what was the fuel rate", while the plane might have gone from 30,999 to 31,001 feet without ever recording the intermediate point.
  I thought that was part of the problem that postgres range types solves? https://www.postgresql.org/docs/current/rangetypes.html
  
  dboreham 6 years ago
  
  Type is one thing. Efficient querying (which implies some appropriate index structure) is another.
  
  asah 6 years ago
  
  Weak example to ask precisely when the plane is at X feet, not an inch higher/lower - this is more typically a range query, and the Postgres range datatype is quite good at this, notably GiST indexes for overlap queries: https://www.postgresql.org/docs/9.3/functions-range.html#RAN... https://www.postgresql.org/docs/current/rangetypes.html#RANG...
  
  vog 6 years ago
  
  PostgreSQL allows for all of that: User-defined types with user-defined index structures and whatnot. PostgreSQL is great and flexible, yet transaction safe and fast. You should really try it. It is one of the highest-quality FLOSS projects I have ever seen, not just in term of code quality, but also in terms of project and community management.
  
  scottlocklin 6 years ago
  
  It is hilarious seeing this problem over and over and over again when it was solved in the APL language family ... some time in the 1960s.
  
  i_don_t_know 6 years ago
  
  Could you explain how this is solved in APL?
  
  scottlocklin 6 years ago
  
  I did a startup solving this problem (yet again) in an APL; we sucked at enterprise sales, but our data structures went rather faster than BTrDB. The most famous example of solving this problem being Kx systems. Less famous, but basically using the same ideas (these guys all worked with Art and the Iversons): http://marketgridsystems.com/
  And no, I won't explain it in detail. I consider the spark team basically cretins, and it would ruin several of my pals businesses while transferring wealth to hedge funds, exchanges and advertising groups making solutions to this problem available for free. I'll continue to snigger up my sleeve at the fact that decades of open source development still haven't noticed things which were obvious in the 1960s.
  Edit add: a great example of this kind of thing, if not this exact thing: some poor guy trying to solve a 25tb problem using spark, parquet files ... finally solving it with ... Awk. Old tools written when 1mb was a lot of memory; sed, awk, tr, cut: vastly better than new tools.
  https://news.ycombinator.com/item?id=20293579
  
  i_don_t_know 6 years ago
  
  I guess this is another instance of Alan Kay's "computer science is pop culture". We don't remember our history because reinventing it is so much fun.

reitzensteinm 6 years ago

The nodes of the tree resemble Clojure's HAMTs, just skipping the hash. It's interesting, because Clojure itself opts for RB trees when building sorted maps, but for keys that can be cast to a bit sequence (strings, integers), AMTs are a pretty good option.

My side project is in some respects quite close to BTrDB; tree based analytics. Stumbling across it made my day, because it appears to have vindicated some design choices I was unsure about.

I've reimplemented Clojure's data structures in Clojure, and made the nodes serialize to a garbage collected KV backend (ardb or Redis depending on requirements). This means that they can persist between program runs, be sent over a network, and don't need to fit in memory. Using structural sharing, you can take a big hash map, add a single key to it, and have both the old and the new side by side.

The next piece is cached map/filter/reduce, which saves intermediate results on each node, making batch jobs incrementally compute. You can take a big hash map on disk (my main one contains ~20gb, a map of string->set of data points), add some more values to it, and recalculate the analytics of the new map in a few milliseconds, having both side by side.

The ultimate goal is to enable you to write more or less idiomatic Clojure programs that behave the same whether you're reducing a map with a thousand entries, or a trillion (e.g. by firing up a bunch of AWS Lambda instances).

lifeisstillgood 6 years ago

This sounds fascinating - is your side project open anywhere ?
- reitzensteinm 6 years ago
  
  Not yet. I'm using it in prod for my own analytics, so I know it works, but it'll need some clean up and features before release. I do plan on it though.

roskilli 6 years ago

Those interested in TSDBs might also be interested in M3DB.io which is Apache 2 licensed, supports a modified TSZ compression implementation with 11x compression ratio, replication of data in the cluster, streaming of data between nodes on add/remove and a Kubernetes operator.

sa46 6 years ago

The compression technique to achieve the 11x compression ratio is pretty clever and is detailed in the Gorilla TSDB paper. Timestamps are encoded with a delta of delta encoding scheme which allows 96% of all timestamps to compress to a single bit within a compressed block.
Gorilla paper: https://cs.brown.edu/courses/csci2270/papers/gorilla.pdf
- roskilli 6 years ago
  
  It is definitely a huge step forward for time series data collection at scale, great props to the authors.
  With M3TSZ vs vanilla TSZ the major difference is that instead of XORing the value component of each datapoint it is determined if the precision of the float value is within a few significant digits and if so, it is turned into an int representation with delta-of-delta of the int value from value to value used to represent the different changes in value rather than just XORing the float bits. Also tracking how many significant digits and dampening the need to encode a change in that from value to value is also performed.
  We should write up the specific algorithm differences for the value component encoding.
  You can see the relative code here in "writeIntVal": https://github.com/m3db/m3/blob/b25e111aab06fb7cde9f7e1be8b6...
tfigment 6 years ago

Thanks I learned something new. Its not general purpose and has some strict limitations. As usual this seems to skip concepts like quality/status of sensor (maybe qnan can be used as a sort of sentinel value) but is interesting newer entry in the field.

galacticdessert 6 years ago

Unfortunately the website is a bit light on details, but I found some interesting information elsewhere:

https://blog.acolyer.org/2016/05/04/btrdb-optimizing-storage...

https://pingthings.io/platform.html

It look very interesting, and I am a bit surprised to see how a tree structure can deliver such high performances in retrieving the data. Querying a long period of time mean that each top node has to be walked to the bottom, right? Isn't that fairly expensive?

The aggregations on node level are also interesting. It is something that also the company I work for (https://www.energy21.com) does for calculations, but we use a relational database and a notification-based system for performing the recalculations and keeping the aggregations up to date.

Cool stuff.

someguy13 6 years ago

You should also look at pingthings.io, which is a commercialization of BTrDB

jedisct1 6 years ago

Where is the pingthings source code, since BTrDB is GPL?
- _jyog 6 years ago
  
  https://github.com/PingThingsIO
  Literally result number three for "pingthings.io source code"

bin0 6 years ago

I read about the kdb+ database from a post the other day as "the fastest" database, and the numbers given did seem incredibly impressive. How does this compare?

marknadal 6 years ago
I wish everybody would just follow Redis' setup for baseline tests:
https://redis.io/topics/benchmarks
```
  > This is an example of running the benchmark in a MacBook Air 11" using a pipelining of 16 commands:
  $ redis-benchmark -n 1000000 -t set,get -P 16 -q
  SET: 403063.28 requests per second
  GET: 508388.41 requests per second
```
Testing against a Macbook Air is a reasonable, as most developers will have access to one or a windows machine of similar stats.
For instance, I have a 5 year old Macbook Air, and I run this test to benchmark (baseline) my database:
https://htmlpreview.github.io/?https://github.com/amark/gun/...
My / Redis test is not very interesting (not real-world use case), but it at least gives a baseline that removes a lot of variables.
Then every database should implement Jepsen-like tests for everything else. That is what we're trying to do, with creating a Jepsen-like distributed systems correctness (https://github.com/gundb/panic-server) and load testing tool in a language that is easy for any developer to access and use.
- pritambaral 6 years ago
  
  > ... as most developers will have access to one or a windows machine of similar stats.
  That's a bold assumption. Looking around my office in India here, I find not a single Macbook, a Windows machine, or any machine from '11. It would also be trouble to get my hands on a Macbook Air '11, if I had to do it for the purposes of reproducing a benchmark. I certainly couldn't buy one today, and even if I were to buy a currently-in-stores Macbook it'd be awfully expensive.
  A much better proposition would be something far more ubiquitous and accessible, something that can be borrowed trivially or bought easily by the vast majority of people, something that stays stable and available for a number of years. An AWS instance or a Raspberry Pi (1/2/3), for instance.
  
  marknadal 6 years ago
  
  Thank you for calling this out. It was the spirit of what I was trying to go for, because the website use "60GB of RAM (EC2 c3.8xlarge)" which I wouldn't even pay for.
  I totally agree with the RasPi suggestion. Though note, the RasPi 3 B+ is actually about as powerful as my 5 year old air, 1.4GHz quadcore vs 1.6GHz quadcore.
  
  pritambaral 6 years ago
  
  > 60GB of RAM (EC2 c3.8xlarge)" which I wouldn't even pay for.
  The point isn't to restrict benchmarks to only high-end machines. The point is to use a stable standard that people wouldn't have trouble running for a few hours. I can rent a machine of that identical type for 20 hours, or buy an RPi, for the low price of $35.
  > about as powerful as my 5 year old air, 1.4GHz quadcore vs 1.6GHz quadcore
  Core counts and clock speeds do not translate to comparable computing power across processor families and instruction sets. E.g., the RPi 4 (1.5GHz, quadcore) is announced to deliver 3x the performance of an RPi 3B+ (1.4GHz, quadcore), and that's on the same instruction set.
  
  philjohn 6 years ago
  
  > Though note, the RasPi 3 B+ is actually about as powerful as my 5 year old air, 1.4GHz quadcore vs 1.6GHz quadcore.
  I don't think that's an apples to apples comparison. The Pi 3 doesn't have an out-of-order CPU so can't do instruction level parallelism. The new Pi 4 is better in this regard, but both are handily beaten by Intel CPU's from the last 10 years.

mzs 6 years ago

Berkeley Tree DB uses GPLv3, go figure

https://github.com/BTrDB/btrdb-server/blob/master/LICENSE

ksec 6 years ago

So even Berkeley has abandoned BSD? Or is GPLv3 part of the U.S. Department of Energy ARPA-E program (DE-AR0000430), National Science Foundation CPS-1239552 requirement?
- cat199 6 years ago
  
  'Berkeley' never used only the BSD license.
  BSD UNIX used the BSD license (among other things).
  There is/were/always will be many projects, departments, professors, etc. doing things needing license at UCB, and each will choose the appropriate one for their circumstances.

fnord123 6 years ago

Is this in memory like bdb? It says it uses http and capnproto which would indicate that it's not in memory.

kentonv 6 years ago

Looks like they switched to gRPC and removed Cap'n Proto in late 2016 / early 2017, but never updated the web page.
https://github.com/BTrDB/btrdb-server/commit/23fda10224bea80...
https://github.com/BTrDB/btrdb-server/commit/7bb3c8d39b0954e...
I'm unable to find any public discussion about this change or its motivations. Oh well.
- fnord123 6 years ago
  
  That's an interesting choice for a system intended to be high performance. Go's GRPC implementation has been suffering from extremely high resource usage due to https://github.com/grpc/grpc-go/issues/1455

sushilari 6 years ago

Are use case driven DBs helpful? Is the use case and possible adoption crowd large enough to sustain a product this critical down the road?

I worry about the proliferation of infra level tools or adopting a new one. Why not make an extension or fork postgres to demonstrate your concept? (Just an example)

unchurched 6 years ago

Timeseries databases have an entirely different set of technical challenges from relational databases. Also, I wouldn’t exactly call this proliferation. There are relatively few decent timeseries databases in existence, some of which have been around since 1999.
So yes, timeseries databases are incredibly helpful if you have to store, well, large amounts of timeseries data quickly.
cat199 6 years ago

> large enough to sustain a product this critical down the road?
This may not be a 'product' per se.
also, these libraries tend to be small and self contained and little changing over time. See also:
https://svnweb.freebsd.org/base/release/12.0.0/lib/libc/db/c...
for a view into the original berkeley db as still incorporated in freebsd base (BSD Berkeley DB v1; not the later lineage by same authors which led to sleepycat / oracle Berkeley DB).
> Why not make an extension or fork postgres to demonstrate your concept?
this isn't that kind of database.
this is a k/v store optimized around a particular use case, k/v stores being what is more traditionally considered 'database' before the term 'database' colloquially implied 'relational database managment system' (RDBMS - e.g. a system to manage your k/v database files relationally..)
well, techincally 'database' is any 'base of data'. but.. yes. 'base of data' -> 'k/v store' -> 'rdbms'
See also:
isam,dbm,bdb,cdb,gdbm,tdb,tokyo cabinet,kyoto cabinet,etc.
concievably, if desired, one would take this library and integrate it into a higher level application, whether that be a custom application datastore, or a new backend for something like postgres.

ashish01 6 years ago

> precomputed statistics over power-of-two-aligned time ranges

Simple google search didn't get much details. Can someone please add some details or links for this?

hinkley 6 years ago

Does this blurb from the about page cover you?
> BTrDB uses a K-ary tree to store timeseries data. The leaves of the tree store the individual time-value pairs. Each internal node stores associative statistics of the data in its subtree; currently, the statistics are the minimum value, mean value, maximum value, and number of data points. BTrDB uses the internal nodes to accelerate processing of these statistical aggregates over arbitrary time ranges.
- oofabz 6 years ago
  
  So basically, a high tech RRDtool?
  
  hinkley 6 years ago
  
  I wasn’t satisfied by the density of information on that site. In fact that paragraph was the closest hung to information I found on the whole site.
  Pretty much the sort of thing my coworkers would try to pass off as “documentation” and someone else would have to do it over for them...
_wmd 6 years ago

https://www.usenix.org/system/files/conference/fast16/fast16...
Basic stats are recorded for the children of each tree node, and this is where the power-of-2 property comes from
undershirt 6 years ago

There’s a more detailed explanation[1] that answers your question. And an interactive visualization[2] of the tree itself that you can click around.
[1]https://github.com/PingThingsIO/btrdb-explained
[2]http://btrdb-viz-latest.surge.sh

marcrosoft 6 years ago

Please change the logo. It is an obvious rip off of sequel pro.

bourgoin 6 years ago

It's the same meme - a stack of pancakes as a visual pun on "software stack"/"full stack." Not the only place I've seen that before, either - I've seen the same idea on T-shirts and laptop stickers.
Funnily enough, I went to sequelpro.com/legal to find that while their software is under the MIT license, as opposed to the more restrictive GPL license used by BTrDB, the website specifically calls out the icon as property of the project and its use is disallowed in forks. So, it looks like they really would care more about the logo being ripped off more than anything.