karteum 8 hours ago

There are aspects I didn't fully understand, maybe someone can explain : on one hand I understand that Object arrays were very slow and raised issues, that PyArrow strings solved the issue (how ?) at the expense of a huge dependency, and that the approach here also solves the issue in a lighter way, but it seems to be somewhat the same way as Object arrays : "So, the actual array buffer doesn’t contain any string data—it contains pointers to the actual string data. This is similar to how object arrays work, except instead of pointers to Python objects, they’re pointers to strings" => OK so... if they are stored in a similar way, how is that more performant or better than regular Object arrays ?

  • hzably an hour ago

    I do not understand it either. The concept (pointer to C string with a length) is obvious and has been done many times. Object allocation saves memory and will be a bit faster compared to larger PyObjects. You can still use the CPython arena allocator that is faster than malloc() for small objects. This is also standard practice.

    Other than that, why is this "for the next generation of scientific computing"?

    The announcements that come out of NumPy continue to amaze me.

  • DHRicoF 8 hours ago

    One way in with address data sparcity is using arenas, a big pull of memory allocated where the strings lives, instead of them "floating" around all the heap.

    the other improvements are related to the representation and layout of utf.

    I've only read this briefly, so I don't know if there are other optimizations there in the memory allocation or anywhere.

Neywiny 13 hours ago

That's an interesting write up and I hope the API change yields more useful data types. I'm not sure about the 28 byte size field, though. The way the rest of the code works it seems like a typo and sound be 8 bytes. I may dig into the code and find out.