My NumPy year: Creating a DType for the next generation of scientific computing

karteum 9 months ago

There are aspects I didn't fully understand, maybe someone can explain : on one hand I understand that Object arrays were very slow and raised issues, that PyArrow strings solved the issue (how ?) at the expense of a huge dependency, and that the approach here also solves the issue in a lighter way, but it seems to be somewhat the same way as Object arrays : "So, the actual array buffer doesn’t contain any string data—it contains pointers to the actual string data. This is similar to how object arrays work, except instead of pointers to Python objects, they’re pointers to strings" => OK so... if they are stored in a similar way, how is that more performant or better than regular Object arrays ?

DHRicoF 9 months ago

One way in with address data sparcity is using arenas, a big pull of memory allocated where the strings lives, instead of them "floating" around all the heap.
the other improvements are related to the representation and layout of utf.
I've only read this briefly, so I don't know if there are other optimizations there in the memory allocation or anywhere.
ngoldbaum 9 months ago

The main difference is the strings are stored in a single contiguous arena buffer (with some minor caveats if you mutate the array in-place). With object strings each string has its own heap allocation.
More details in NEP 55: https://numpy.org/neps/nep-0055-string_dtype.html
This post is based on the content of a 25 minute talk and it’s hard to explain everything fully…
kccqzy 9 months ago

I'm completely guessing here, but my guess is that regular PyObject arrays require single threaded access with GIL. But if you have pointers to strings without involving Python machinery you don't need the GIL for each access. I would also guess that when using object you actually throw away all the known type information and you have to repeatedly call PyString_Check or similar to do type checks to ascertain they are strings.
hzably 9 months ago

I do not understand it either. The concept (pointer to C string with a length) is obvious and has been done many times. Object allocation saves memory and will be a bit faster compared to larger PyObjects. You can still use the CPython arena allocator that is faster than malloc() for small objects. This is also standard practice.
Other than that, why is this "for the next generation of scientific computing"?
The announcements that come out of NumPy continue to amaze me.

tkahlrt 9 months ago

Dupe: https://news.ycombinator.com/item?id=41927436

davedx 9 months ago

Prior art seems highly relevant here. For example this is how v8 represents strings internally: https://iliazeus.lol/articles/js-string-optimizations-en/

Super interesting!

Neywiny 9 months ago

That's an interesting write up and I hope the API change yields more useful data types. I'm not sure about the 28 byte size field, though. The way the rest of the code works it seems like a typo and sound be 8 bytes. I may dig into the code and find out.

ngoldbaum 9 months ago

Ah I could see how that’s confusing. I was trying to indicate that the size stored for the string in the example is 28, but it’s stored in a 64 bit uint.