Numba: High-Performance Python with CUDA Acceleration

devblogs.nvidia.com

276 points by kumaranvpl 8 years ago

pavanky 8 years ago

I'd also like to shamelessly point out something I work on: https://github.com/arrayfire/arrayfire-python

It is a python wrapper around https://github.com/arrayfire/arrayfire and allows the code you write to use CUDA, OpenCL or x86.

snthpy 8 years ago

Sounds great! However without having looked at the API, my first reaction is that it's yet another interface to learn. What would be awesome is if I could just take my existing numpy or theano code and drop in an arrayfire object. Is that possible or how similar is the API?
- pavanky 8 years ago
  
  There's an effort to have a drop in replacement for numpy using arrayfire. https://github.com/FilipeMaia/afnumpy
  It is still a work in progress and requires some upstream changes in arrayfire to support the numpy api better.
  
  snthpy 8 years ago
  
  Thanks
QuadmasterXLII 8 years ago

I've used this, it is surprisingly intuitive-- Thanks for making it!

lhenault 8 years ago

How does this compare to CuPy (https://cupy.chainer.org/) ? It is now independent from Chainer, is highly compatible with numpy and supports both CUDA and CuDNN.

Loic 8 years ago

For the CUDA part I cannot tell, but Numba is also compiling on the fly your Python code into machine code using LLVM. This where it shines. For example, instead of pushing your code into Cython or a Fortran library, you can keep writing in simple Python and get your code to run in some cases nearly as fast as Fortran. This is my use case. I haven't used the CUDA features yet.
- fnl 8 years ago
  
  But LLVM doesn't support vectorizing, like AVX or SSE4, right? So I don't think that would be nearly as fast as fully (Intel-) CPU optimized code...
  EDIT: Let me hedge that a bit, to advanced AVX instructions, as LLVM can do simple loops and such.
  
  radarsat1 8 years ago
  
  Your comment surprised me as clang is regarded to be pretty competitive these days (compared to gcc). I don't know the current state of things but a quick search revealed at least one sentence, http://llvm.org/docs/Vectorizers.html, "the loop below will be vectorized on Intel x86 if the SSE4.1 roundps instruction is available." So it seems SSE4 is supported by LLVM?
  I was going to say maybe it's a new thing, but the following post also talks about SSE4 and is from 2011: http://blog.llvm.org/2011/12/llvm-31-vector-changes.html
  Maybe it only supports a subset of SSE4? Do you know the details, compared to other compilers?
  
  fnl 8 years ago
  
  The source of my confusion is that the last time I looked into LLVM's SIMD support was in the context of looking at Rust, a bit more than a year ago or so, and back then my conclusion was that neither (Rust or LLVM, then in version 3) are very good tools for that. It seems I was very wrong, at least on the LLVM side.
  EDIT: Sorry, to reply to your question, my concern is not GCC vs. clang; If you want max out your vector ops, I would suggest you should compare to ICC as the "standard", at least on Intel CPUs.
  
  radarsat1 8 years ago
  
  Yep, I understand about icc. I actually wonder what is so difficult about optimising the way icc does, what does it actually do so much better than gcc? Anyone know of a good analysis?
  
  fnl 8 years ago
  
  I'd mostly attribute that to the MKL and their ability to just have to deal with their own instructions. But that's just an "educated guess".
  
  mattnewton 8 years ago
  
  Probably also having full time people working in the same company as the hardware guys who's job is to make the hardware look good.
  
  radarsat1 8 years ago
  
  I understand the social reasons, but I was wondering more about what compiler does in terms of technical achievements that goes beyond what gcc/clang can successfully generate. Surely this must be something that can be studied empirically.
  
  Coding_Cat 8 years ago
  
  I'm not sure what parts are clang and what parts are LLVM, but I recently did some tests comparing g++ to clang++ for auto-vectorization and they were very on par. I'd even say clang was a little better than g++ with full optimizations turned on on average.
  This was on an AVX2 machine, testing the (auto)-vectorization performance of expression templates. Anything with a compile-time unknown stride or a random gather failed horribly with both. Using (semi-)explicit vectorization turned out to be much faster still.
  
  fnl 8 years ago
  
  Impressive; I didn't take a look since the 3.x series, so I am totally stunned by the amount of "love" that LLVM has received lately (4, which was released just a few months ago, and up to the coming version 6, that is under development still).
  
  keldaris 8 years ago
  
  While your statement can certainly be correct for a sufficiently stringent definition of "advanced" (and replacing "instructions" with "code patterns", etc.), in my experience in using both clang for C++, Julia (a language largely reliant on the LLVM optimizer) and LDC (a D language compiler on top of the LLVM) vectorization support is competitive (and sometimes superior) to GCC. Comparisons with ICC are somewhat more complicated, ICC is unparalleled for specific patterns and often fails horribly in other things, the variance there is quite huge.
  Disclaimer: by "vectorization" I'm referring to SSE4, AVX and AVX2, I haven't had a chance to try out AVX512 yet.
  
  lliiffee 8 years ago
  
  I believe that LLVM 6 has finally introduced this, e.g. see http://llvm.org/docs/Vectorizers.html#vectorization-of-funct...
  
  Joky 8 years ago
  
  Uh, what you're pointing at was introduced in 2012 in LLVM.
  
  fnl 8 years ago
  
  Only in parts, not all instructions, and some functionality it did have was buggy. 4 and 5 are much more advanced/competitive on SIMD issues, it seems.
  Edit: Oh, sorry you meant that other guy's link to LLVM's vectorization tutorial. Ignore my reply ...
  
  fnl 8 years ago
  
  Oh, cool, I see LLVM now even sports (basically all of) SSE4.2 and AVX-512. As always, that project amazes... :-)
  
  HelloNurse 8 years ago
  
  Do you mean LLVM 5?
  
  fnl 8 years ago
  
  Yes, indeed even LLVM 4 already improved its AVX-512 support.
  http://releases.llvm.org/4.0.0/docs/ReleaseNotes.html
  Really impressive how many new things came to LLVM this year!
  
  marmaduke 8 years ago
  
  It does and this can be done sometimes automatically by Numba/LLVM
  https://github.com/numba/llvmlite/issues/270
dagw 8 years ago

One advantage of Numba is that it doesn't require CUDA. You can easily write your code so that if the machine it's running on has CUDA then it will use that and if it doesn't it will just JIT it for the CPU.

TheAlchemist 8 years ago

This looks really good, however I struggle to find real applications for that.

For almost all practical application, I use pandas or keras / tensorflow. I'm probably biased as I mostly work with simple data that doesn't require complicated calculations.

Would somebody have some benchmarks against pandas for some standard operations ?

wesm 8 years ago

> Would somebody have some benchmarks against pandas for some standard operations ?
pandas creator here. Numba is a complementary technology to pandas, so you can and should use them together. It is designed for use with NumPy arrays and so does not deal with missing data and other things that pandas does. It does not help as much with non-numeric data types.
- j88439h84 8 years ago
  
  Are you saying if that you have missing data you can't use numba, or if you have missing data, and you use numba together with pandas, that pandas will handle the missing data where numba alone could not?
  
  grej 8 years ago
  
  Heavy numba user here. What Wes is saying is that while Pandas handles some of those missing values in an automated way, if you choose to use numba it uses numpy arrays so you may have to handle some of those things yourself. I have at times used a separate numpy array to indicate whether values are missing or not. You could also use a value which is far out of the bounds of what you might ever see in your real data, then test for that while you're looping over those values (eg. fill missing values with -3.4E38 if you have a float32).
  Depending on what you're doing, you might be able to use numpy.nan as a value. It does work inside of numpy arrays. But some methods that operate on those objects might not work as you expect.
  For instance, if you run numpy.mean on a numpy array of [nan, 4, 5], it will return nan. If you run the same thing on a pandas dataframe of the same values, you'll get 4.5.
- t8ge55geu8ygt 8 years ago
  
  Are you saying that if you have missing data you can't use numba, or that pandas will handle that part which numba couldn't otherwise do alone?
kronos29296 8 years ago

Nearly as fast as Numpy but never faster for jitted code while for the first run it takes longer as the jit need to generate llvm code the first time. If your calculation does not use unsupported features like classes (last time I checked they were not supported 1 year ago) and needs to be written as a loop rather than vectorized code, numba can be used to speed it up.
I believe Scipyconf 2016 had a talk on numba where he goes into it in great detail. Just search it up on Youtube.
Anything that is not convenient to be written as numpy arrays can be written using numba. Also it works with pure python code so your prototype can be used at scale with nothing but a decorator.
- jzwinck 8 years ago
  
  Numba can be much faster than NumPy for some calculations. For example if you want to compute both the min and the max of an array, NumPy requires two passes but in Numba it is easily done in one. This can give close to 2x speedup for arrays which do not fit in cache.
  
  taeric 8 years ago
  
  That sounds suspiciously like something that could have easily been fixed in NumPy.
  
  dr_zoidberg 8 years ago
  
  What he refers to is that you have to ask for them explicitly:
  import numpy as np arr = np.array([1, 3, 7, 5, 4, 3, 1, .100]) maxval = arr.max() # 1st pass minval = arr.min() # 2nd pass
  Whereas with numba you'd have something like this:
  from numba import jit @jit def maxmin(arr): maxval, minval = arr[0] for e in arr: if e < minval: minval = e if e > maxval: maxval = e return minval, maxval
  And that will get optimized to numpy-like speeds, but with a single pass over data. So for large arrays, you'll get about 2x speedup, since memory access is the bottleneck.
  As for optimizing this use case for NumPy, I'd go for a cythonized maxmin() function. Which is pretty much the same numba does, but you're moving the compilation overhead from the JIT into the compiling step of the module.
  
  taeric 8 years ago
  
  This was what I assumed with the message. Doesn't change my point. If this is something that is truly a bottleneck, then it should be baked into numpy. In particular, agreeing for the array to calculate min/max/p50/p75/... in one pass is something that would make sense.
  Regarding the moving the calculation, yeah. I get that. I argue that the compiling step of the module happens once for the module, no? The JIT will be something you force onto every execution. Right?
  And none of this actually means this library shouldn't have been made. Just that it is a poor example for why it is better.
  
  dr_zoidberg 8 years ago
  
  For some reason, I can't reply to nerdponx, so here it goes:
  I'd add some "intelligence" (if you will) to the class, so that when I calculate the max(), I'll also track the min, average, etc, save those values in an internal cache, and return the max. Next function call (wether it's max, min, avg, or any other of the cached values), it gets the result from the precomputed cache.
  Of course, some operations will affect said results, and there are two ways to handle that: either modify the cached results, if we're talking about some change that can be formulated (say, multiplying by 2 will multiply every computed value by 2), or simply invalidate the cache and re-compute the next time one of this functions is called.
  As it was said, this would mean calculating some things that aren't the explicitly asked thing (max, min, avg, etc), and giving a worse case scenario, in the hope that it will result in a speedup on some usecases.
  My guess as to why they don't do it? Because it's not a "generalizable" problem, and you have the machinery at hand to implement your custom function that fits your particular use case (say, using Cython and the numpy/cython wrappers).
  
  nerdponx 8 years ago
  
  You can't bake this into NumPy without a compiler or JIT. Python calls don't "know" about each other. The only way to do it would be to have a function that returns both the max and the min.
  
  taeric 8 years ago
  
  Like the sibling said, you could bake the stats into member fields to cache the values. I'm guessing it wouldn't caused any real slow down, but I can understand the concern. Would also help for repeated calls to the same stat.
  Though, I would also think a general .stats method would make the most sense. It is quite often nowdays to want a lot of stats.
  And again, I am not arguing that the new lib shouldn't exist. I just question that example.
  
  dr_zoidberg 8 years ago
  
  See my comment on parent, for some reasong (I think your comment was too young) I couldn't reply to you directly. As for the compiler, since Microsoft handled the MSVC For Python (2.7), or if you were using Python 3 to begin with, I haven't had any problem.
- simonw 8 years ago
  
  This 2.5hr tutorial maybe? https://m.youtube.com/watch?v=SzBi3xdEF2Y
shoyer 8 years ago

Another pandas developer here.
Numba can give you similar performance to what you'll see for highly optimized operations in pandas (e.g., for groupby-sum or a moving average), but you would have to write the low-level loops yourself. Like Wes writes, it's a complementary technology: in principle, the low-level loops in pandas currently written in Cython could be ported to Numba instead.
I had a little project where I experimented with this a few years ago: https://github.com/shoyer/numbagg. Since then, I expect numba performance has only improved.
Note that this is unlikely to ever happen in pandas itself, for various reasons. The existing routines in Cython already work (and unlike Numba, have a good story for distribution), and the algorithmic core for next version of pandas is being written in C++.
Aeolus98 8 years ago

A while ago I had to do a complex ML task.
It involved tons of time series data that followed a state machine, with very little training data.
A useful algorithm to force a series of noisy predictions to follow a state machine is the Viterbi decoder.
Numba let me write a JITted version that got order of magnitude improvements, especially when there were over 10^8 time series points.
It's a great piece of software, if a bit finicky sometimes.
- mtw 8 years ago
  
  can you elaborate on the finicky part?
  
  glup 8 years ago
  
  I've noticed two pain points: Installing outside of Anaconda can be a real chore, and error messages were extremely unhelpful (as of about 12-18 months ago, hopefully it's better now).
  
  dr_zoidberg 8 years ago
  
  I work on non-Anaconda environments and this single pain point has caused me to stay away from it. I do some borderline code where I need the scientific stack and Django/flask/"weby libraries", so I could never pull "going full Anaconda" on the stack.
ballooney 8 years ago

This example doesn't do the entire numba project justice, but if you've ever written a for-loop in a bit of python code that does number crunching, you'll notice how much it slows everything down, and the numba jit provides a decorator that yields an extremely quick win to get often 1-2 orders of magnitude of improvement in calculation time. It's less of a win if you're only every working with already vectorised data structures and algorithms.
I was confused by your comment though, specifically the idea that you are using tensorflow because you 'mostly work with simple data that doesn't require complicated calculations'. This seems very contradictory. Have I misunderstood?
- TheAlchemist 8 years ago
  
  A bit unclear indeed. By that I meant that most of stuff I do either fits well in pandas dataframes and requires mostly standard operations, already well implemented in pandas (actually, there are so many of those, that by 'standard' I mean practically all operations I need) or it's image data, that Keras handles very handily.
451mov 8 years ago

definitely a lot faster than pandas!
question: how do you debug a function which has a super complex decorator slapped on top of it?
- dagw 8 years ago
  
  question: how do you debug a function which has a super complex decorator slapped on top of it?
  With difficulty. If you find yourself in a situation where you have a function that works without the numba decorator, but fails with it then it's time to break out your llvm reading skills: http://numba.pydata.org/numba-doc/0.10/annotate.html
  That being said it is very rare that that happens.
  
  marmaduke 8 years ago
  
  You can just remove the decorator?
  
  glup 8 years ago
  
  No, because certain things are fine in an interpreted function that will fail to compile, and throw totally uninformative errors. E.g., one time I spent about 3 hours trying to figure out why a relatively simple, perfectly functional Numpy function, limited to the subset of Numba primitives wouldn't compile. A labmate with actual knowledge of C looked at the code and in about 30 seconds suggested it might be having two return statements; she was of course correct that this was the problem. Doh.
  
  marmaduke 8 years ago
  
  Ok but generally it is possible to debug by omitting the decorator.

sandGorgon 8 years ago

oh wow! does anyone know how this compares to julia CUDA performance ?

wallnuss 8 years ago

Given that CUDAnative.jl is beating CUDA c in some benchmarks and in others it is a bit slower, I would suspect that NUMBA is similar in performance.
The thing where the Julia CUDA support really shines is that it is supporting arbitrary Julia structs and not just a blessed few datatypes like Float32.
code5fun 8 years ago

not even close baby ( ͡° ͜ʖ ͡°)
- littlepeter2 8 years ago
  
  So who is slower, numba or julia with cuda? And how do you know it is not even close?

m3kw9 8 years ago

So what happens if it runs in something like jupyter notebook mixed with runtime code?

marmaduke 8 years ago

It works fine.

anc84 8 years ago

(2013)

grej 8 years ago

See the note at the top of the article:
Note, this post was originally published September 19, 2013. It was updated on September 19, 2017.
- p1esk 8 years ago
  
  So, what exactly was updated?
  
  bsprings 8 years ago
  
  When I originally wrote the post in 2013, the GPU compilation part of Numba was a product (from Anaconda Inc., nee Continuum Analytics) called NumbaPro. It was part of a commercial package called Anaconda Accelerate that also included wrappers for CUDA libraries like cuBLAS, as well as MKL acceleration on the CPU.
  Continuum gradually open sourced all of it (and changed their name to Anaconda). The compiler functionality is all open source within Numba. Most recently they released the CUDA library wrappers in a new open source package called pyculib.
  Some other minor things changed, such as what you need to import. Also, the autojit and cudajit functionality is a bit better at type inference, so you don't have to annotate all the types to get it to compile.
  We thought it was a good idea to update the post in light of all the changes.

MagicStickOne 8 years ago

Great1

moon_of_moon 8 years ago

Real nice NVidia. Now how about some support for Linux on Optimus/Hybrid GPU laptops?

jhasse 8 years ago

Stop buying them ;)