Show HN: Empirical – a language for time-series analysis

111 points by chrisaycock 5 years ago

I wrote Empirical to address issues I had routinely faced in my career.

I spent ten years in quantitative finance, primarily statistical arbitrage and high-frequency trading. I always ran into problems working with time-series data, from fetching the data to expressing the algorithms to watching my backtests fail from a type error after four hours.

As a result, Empirical has statically typed Dataframes and builtin timestamp types. It can infer the types from a file as long as the input source is known at compile-time, such as in a REPL.

Today's release is the very first public beta. There is a ton of work to do still; see the roadmap for further details:

https://github.com/empirical-soft/empirical-lang/issues/1

I have released the source code under the AGPL with the Commons Clause. A proprietary license is available for users who need more commercial-friendly terms.

It was a long journey to get here. I would like to thank everyone who participated in the private beta. And lastly, I would like to thank Y Combinator; I won a Startup School grant last year.

hx2a 5 years ago

I like how SQL syntax is integrated into the language. That is very convenient.
Can you compare/contrast this language with q? Can I write complex queries as I can do in q? How about performance?
- chrisaycock 5 years ago
  
  q was definitely an inspiration.
  The biggest difference is that Empirical is statically typed. Also, arbitrary expressions are allowed anywhere when dealing with Dataframes.
  For example, I can sort by anything:
  sort my_table by col1 - col2, foo(col3)
  And I can aggregate by an external array as long as the lengths are the same:
  from my_table select foo(col1) by some_array_with_repetition
  As for performance, I've tried to make Empirical "reasonable" at this stage, though I haven't put too much effort into it beyond that. I'm more worried about the ergonomics of the language right now, so I don't have SIMD, dependency analysis, etc.
  One of the biggest things Empirical lacks compared to q is nested arrays. That's a major issue I have to tackle in the virtual machine.

dmix 5 years ago

This is a great programming language homepage content wise. The copy and examples are good, gets right to the point.

But the navigation needs some work. The problem is the top logo isn't clickable, so I can't go back to the homepage from any of the subpages without clicking the back button, and the subpages don't have the primary navigation at the top either.

Edit: also the navigation should be repeated in the footer.

chrisaycock 5 years ago

Thanks for the tips. Web design is a complete mystery to me, so I'll gladly take all the feedback I can get.

jaupe 5 years ago

I really like that it feels like a dynamically typed language but with the security of type inference. That's really cool

chrisaycock 5 years ago

That's exactly the feel I've been going for. Instead of "gradual typing", I wondered if there was a way to make everything statically typed but still read from a file. I settled on a combination of type providers (F#) and compile-time function evaluation (D).

mamcx 5 years ago

Pretty cool. And is similar to my idea for a relational language:

https://bitbucket.org/tablam/tablam/wiki/Syntax

Only, this has shipped!

cedricd 5 years ago

This looks super interesting. We're doing a startup right now that transforms data into time series tables. Building datasets from those in sql has been challenging enough that we built out a UI to do it. This could be another elegant approach.

chrisaycock 5 years ago

Narrator looks interesting. I'm only reading CSV files for now, but I do want to handle SQL pushdown at some point in the future. Feel free to ping me if you want to swap war stories.
christopher.aycock (AT) empirical-soft.com

e12e 5 years ago

Looks very interesting. Is there (currently) any facility for saving work? Like writing dataframes and/or functions to disk? I had a quick look at the tutorial and source - but only found the stuff handling csv input.

chrisaycock 5 years ago
It's pretty rudimentary, but you can save CSV files from a Dataframe with:
```
    store(df, "some_file.csv")
```
I don't have modules yet, but you can load an Empirical code file from the REPL with a "magic command":
```
    >>> \l my_functions.emp
```
The full list of magic commands is available with:
```
    >>> \help
```

victorNicollet 5 years ago

I really like the `asof` keyword, so I'll be stealing it for my own language :-)

I suppose that `from ..` does not print the result, but rather returns a new dataframe that just happens to be printed by the REPL ?

chrisaycock 5 years ago

You are correct about "from". Empirical is a normal programming language; when an expression is evaluated in the REPL and the result isn't stored, then the result is printed to the screen.

floki999 5 years ago

Very nice. I realize that it currently runs in a shell, but one ingredient I would absolutely want is built-in charting capabilities - especially when running back-tests etc.

chrisaycock 5 years ago

Visualization is definitely important and I would love to hear anybody's thoughts on it. I think I should make a wrapper to an existing library, like matplotlib or ggplot2.
- X6S1x6Okd1st 5 years ago
  
  I've really been enjoying vegalite, but that does require a browser.

pnichols 5 years ago

Any comments on performance today or in the near future? Any features which should provide a big speedup in the future as compared to competitors (kdb, pandas)?

chrisaycock 5 years ago

I've primarily been focused on the ergonomics of the language, so I've only tried to make performance "reasonable" for now.
Longer-term performance objectives are:
1. JIT - I designed the VM's byte code to be both interpretably and a mid-level IR to LLVM. Currently I just interpret everything since there is almost no runtime overhead for vector operations. However, compiled code will greatly speed-up any scalars in a loop.
2. SIMD - Since the VM's opcodes are already statically typed and vector-aware, integrating OpenBLAS and SLEEF (or Intel's MKL and VML) should be straightforward.
3. MIMD - Ideally I can just lean on existing libraries, though I'm not above embedding OpenMP if that gets the job done.
4. Distributed - Now comes the hard part. If we want MPI-level performance, I need to have more sophisticated scheduling. Which leads us to...
5. Streaming - This is the real holy grail. There has been a ton of research in the database community to get away from the "Volcano model" (iterators). I want to have the compiler generate streaming-aware opcodes for the VM based on the nature of how the data is to be consumed. I believe this will require a type system that can track the "context" of the computation, similar to how Koka and F* track side effects. I'm not aware of any general-purpose language that has compiled streaming.
- corysama 5 years ago
  
  Looking at interpret.cpp for SIMD potential: I bet you could add an allocator for std::vector that aligns and pads everything to 32 bytes then just replace all of the scalar op loops with loops over AVX intrinsics. No need for an external library.
  
  chrisaycock 5 years ago
  
  That's a possibility to get something running near term. I'm trying to avoid CPU-specific intrinsics since I have a fantasy that this might be run on ARM in the future, though that may be getting really ahead of myself.
  
  corysama 5 years ago
  
  NEON intrinsics are pretty easy as well ;) As long as you are doing simple +-*&| ops they work the same as SSE.

atemerev 5 years ago

Thank you! Aside from non-free kdb, there are not much products available in the field. I will test it and see if it works for my tasks.

chrisaycock 5 years ago

Thanks for trying it. As I mention elsewhere, Empirical is pretty limited right now because this is the first beta release. If you run into something specific that's missing, please let me know about it, ideally on the issue tracker:
https://github.com/empirical-soft/empirical-lang/issues
That way I'll know what targets to hit.

mvcalder 5 years ago

My corporate overlords are blocking access to your site. They say your cert is invalid.

chrisaycock 5 years ago

Are you able to post any error messages? I haven't gotten notice from anyone else, and my own browser doesn't have any complaints. But if there's a problem, I want to get it fixed.
- mvcalder 5 years ago
  
  I sent them your reply and magically all is good. Thank for the "ammo" and the work.

chubot 5 years ago

This looks cool! A few comments:

- I peeked at the VVM implementation, since I've been looking for an implementation of data frames for my shell Oil [1]. I've looked at Hadley Wickham's dplyr code, R's data.table library, Pandas (which is somewhat awkwardly based on NumPy), and a little bit at Apache Arrow. I also remember R has a "zoo" library though I haven't used it much.

Was your designed influenced by any system in particular? I've had a hard time finding any descriptions of data frames other than the code. I have less experience with time series, but I believe the main issue on top of data frames is having joins by time columns (e.g. your "asof" operator).

But otherwise, could VVM could be used for dplyr-style analysis? dplyr has a very rich set of operators.

https://www.rstudio.com/wp-content/uploads/2015/02/data-wran...

Hadley does a good job of describing the high level philosophy, but I've been looking for low level advice, like how to do vectorized math quickly with overflow checks and so forth. (Do you have ints or is everything a float?) Maybe it's not a big deal, but it's not something I have experience with. I'd like to read about someone's implementation, especially in portable C / C++. I think a lot of earlier systems were in Fortran/assembly.

I guess your implementation is fairly different because the language is statically typed. I peeked and it looks like DataFrame is std::vector<void star>, which makes sense for static typing.

Have you looked at how Julia does things? It has macros and fast code generation. I imagine you started this project before Julia 1.0, where they added NA for data frame support. I'm more of an R and Python user, but I find Julia pretty interesting, e.g. something like this approach is impossible in R or Python AFAIK:

http://scattered-thoughts.net/blog/2016/10/11/a-practical-re...

http://scattered-thoughts.net/blog/2018/08/16/julia-as-a-pla...

- I watched your video, which is a nice demo. My feedback: if you want to maximize the number of people that get through it, I would make the font bigger and also raise the bottom of the window so the typed code is more readable. The code is nearly clipped off which causes some friction for viewers. Hope that's helpful.

- Nice to see someone else using Zephyr ASDL! I have linked these blog posts a few times here: http://www.oilshell.org/blog/tags.html?tag=ASDL#ASDL

Anyway I hope to have time to play with this a bit more. I don't have that many time series use cases but I'm definitely interested in data frames!

[1] The slogan for why a shell could use data frames is: "the output of ls and ps is a table". For those unfamiliar with data frames, here's my intro: What Is a Data Frame? (In Python, R, and SQL) http://www.oilshell.org/blog/2018/11/30.html

chrisaycock 5 years ago

VVM is column-oriented, which is how pretty much every Dataframe implementation works. Each column is a vector of whatever the user's type represents; Int64 in Empirical is i64 in VVM and int64_t in C++.
VVM has its own statically typed assembly language. You can see examples of it in the regression tests; here's one that sorts a table:
https://github.com/empirical-soft/empirical-lang/blob/master...
Since it's a virtual machine, VVM is pretty low level and really only meant as a compilation target. While it does some of the heavy lifting to match keys or determine the order of indices in a vector, Empirical is needed to coordinate the moving pieces.
Empirical takes a very different approach from Julia. Empirical is statically typed (not "gradually" typed), is focused around Dataframes, and compiles to a VM that is then interpreted.
As I mention elsewhere, I haven't done much on the performance side of things. I eventually want SIMD and JIT, but my priority for now is getting the Empirical language right.