tolower() in bulk at speed

228 points by fanf2 2 years ago

dylan604 2 years ago

At the end of the day, I'm very thankful that other people are working/solving things like this so that I just write whatever high-level code I want to get something I need done now. I have lots of respect for this level of dev'ing effort so I can use the system as a tool like a person driving a car without knowing how internal combustion works. I consider it a version of standing on the shoulders of giants.

dragontamer 2 years ago

Seems a bit odd to stop at SWAR on today's systems.

Every system I'm aware of has 128-bit SIMD implemented: either SSE (x86), NEON (ARM), or AltiVec (POWERPC). As such, 128-bit SIMD is the "reliably portable" SIMD operation.

Of course, for fastest speeds, you need to go to the largest SIMD-register size for your platform: 512-bit for some Intel processors, rumored AMD Zen4 and Centaur chips. 256-bit for most Intel / AMD Zen3 chips. 128-bit for Apple ARMs, 512-bit for Fujitsu A64 ARMs, etc. etc.

> And these are the fastest kinds of instructions :-)

And it should be noted that modern SIMD instructions execute at one-per-clock-tick. So the 64-bit instructions certainly are fast, but the SIMD instructions tie if you're using simple XOR or comparison operations.

--------

This style of code might even be "reliably auto-vectorizable" on GCC and CLANG actually. I wonder if I could get portable auto-vectorizing C from these examples.

dietrichepp 2 years ago

Naive version to explore autovectorization:
https://gcc.godbolt.org/z/bcefvznG3
Yes, there's some amount of autovectorization here. Seems like a mess of a function. Here's something more like tolower8():
https://gcc.godbolt.org/z/h8GTanodd
The generated code definitely looks funny to me. Here is a manual vectorization, which is shorter:
https://gcc.godbolt.org/z/1e4odEsKq
The issue of loading into your vector register... well, there are some dirty tricks for that. The 16-byte slice containing the first byte, and the 16-byte slice containing the last byte, can both be loaded into registers and then shifted around in order to construct the desired value. Note the careful wording here... these slices might be the same slice. Or you can iterate over the 16-byte slices containing the array, and shift as you go, if you're storing into a different location. Or you can use various masked load/store operations on various architectures.
- nimish 2 years ago
  
  > Seems like a mess of a function
  It has to deal with an arbitrary n, so it's going to have some messy bits to deal with the "fringe": https://www.sigarch.org/simd-instructions-considered-harmful...
  
  janwas 2 years ago
  
  It's often possible to avoid fringes via padding/overallocation. And if not, it may be possible to use unaligned loads/stores to handle the fringe in a single (final) iteration: https://github.com/google/highway#strip-mining-loops
  It is actually feasible to write vector-style code using SIMD instructions. Yes, the SIMD ISA is more complicated because of the various accumulated extensions, but this is what we currently have. And a bit larger code size (for one final loop iteration) doesn't seem to be a big deal.
  
  tialaramex 2 years ago
  
  This sort of stuff is where Iterate Loops are good:
  https://github.com/google/wuffs/blob/main/doc/note/iterate-l...
  WUFFS wants this because it demands all the checking at compile time (WUFFS code with a potential buffer overflow just won't compile), so if you need bounds checks you'll be writing them out by hand and the iterate loop often allows you to express a correct solution with no actual checks.
  
  dietrichepp 2 years ago
  
  It’s more of a comment on how messy that code gets at -O3… it looks a lot messer than the -O2 code.
  (I think, it’s time, to say that “‘X considered harmful’ considered harmful”.)
  
  dzaima 2 years ago
  
  -O2 doesn't have any usage of SIMD/vectorization though. Of course it'd be simpler.
  
  dietrichepp 2 years ago
  
  ...Have you looked at the -O3 assembly that I'm talking about?
  There's a short SIMD loop, which is simple, and fairly easy to understand. At -O2, you get a short scalar loop, which is simple, and fairly easy to understand.
  If you were writing this function by hand, you might combine the SIMD loop with the scalar loop. That's not what happens, though. Instead, you get two versions of the SIMD loop which handle different alignments (unnecessary) and then, what appears to be an unrolled version of the scalar loop.
  What’s going on here is that there are a bunch of transformations enabled at -O3 that come along with autovectorization. You don’t just get autovectorization by itself, you also get transformation passes that are designed to put your code into a state where vectorization is easier. You also sometimes get a number of transformations which are (empirically) dubious, in the sense that the resulting code is both larger and slower.
  
  dzaima 2 years ago
  
  There's only one SIMD loop - a 128-bit one. There's a 64-bit SIMD segment for handling handling when the tail has ≥8 items (which is gonna be a pretty decent speedup for 8≤n≤15 byte inputs). If you're gonna operate on only very large inputs, it's quite unnecessary, yeah, but for a lowercasing function small inputs will be very common.
  The unrolling of the scalar tail is quite unnecessary though (gcc does so even with -fno-unroll-loops). clang doesn't[0], though it does unroll the main loop (though that can be disabled with -fno-unroll-loops).
  [0]: https://gcc.godbolt.org/z/n85aavnEo
  
  dietrichepp 2 years ago
  
  You're right, only two loops.
  It is still messy, though. And "it has SIMD" is not really a sufficient explanation for the messiness. I think of -O3 as a kind of "throw everything at the wall and see what sticks". You're pressing a button that makes your code big, messy, and fast (although not always big, not always messy, and not always fast).
  You can spend as much time as you like explaining the reasoning behind these code transformations, but in the end, there are very good reasons why people don't use -O3 as the default optimization level.
- dragontamer 2 years ago
  
  > The issue of loading into your vector register... well, there are some dirty tricks for that.
  Good discussion. I think the only method you haven't talked about is the simple "unaligned load" instructions (which might be the simplest, and most portable way, to do this). I know that ARM and x86 both can do unaligned loads no problem, but at a possible performance penalty.
  
  dietrichepp 2 years ago
  
  The reason I didn’t mention unaligned load is because it can cross a page boundary and fault. With aligned loads, you can avoid this.
  (Edit: The idea is that with an aligned load, you only load 16-byte blocks that contain valid data. This can’t page fault, but you can load data you didn’t want, which is ok. Writing is trickier because the write may have to be atomic and can’t spill.)
  
  zen_1 2 years ago
  
  Another (admittedly quite niche) gotcha I've experienced with aarch64 unaligned SIMD loads specifically is that they will fail (raising a data abort exception) if they access memory that isn't specifically designated as "normal ram" by the MMU.
  This can happen if you're writing code for an embedded system, or even if you're calling memcpy with a too-smart-for-it's-own-good libc (such as Newlib, that ships with the aarch64-none-elf toolchain ARM provides...) implementation without first setting the MMU up.
sedatk 2 years ago

"labels are frequently less than 8 bytes long, and therefore fit inside a 64-bit register. So it probably isn’t worth dealing with the portability issues of working with wide vector registers (Especially since I could not find a quick way to load an arbitrary number of bytes into a vector register with AVX2 nor with NEON.)"
- dragontamer 2 years ago
  
  DNS labels like "ycombina" (tor) you mean? :-)
  
  fanf2 2 years ago
  
  Two out of three ain’t bad :-)
  news.ycombinator.com
  
  cratermoon 2 years ago
  
  Just off the top of my head, I'm betting that 9 out 10 of the most visited internet domains are 8 bytes or less. google, facebook, twitter, youtube, baidu, yahoo. I think instagram might be one of the few that isn't 8 bytes. That's not counting URL shorteners, either.
adrian_b 2 years ago

Actually most modern SIMD instructions have a throughput of more than one per clock cycle.
Simple SIMD instructions, like integer addition, comparison, XOR, have a throughput of 2 per cycle (many ARM cores), 3 per cycle (most Intel CPUs) or even 4 per cycle (AMD Zen 3).
Similar simple 64-bit instructions have a higher throughput in instructions per cycle, from 3 per cycle up to 5 per cycle (Alder Lake), but when considering the register size the total throughput of the SIMD instructions is always higher.
nimish 2 years ago

> Every system I'm aware of has 128-bit SIMD implemented
These still come in handy on embedded systems where there isn't a vector unit
fanf2 2 years ago

Is there a good way to load an arbitrary number of bytes (up to the vector size) into a vector register? As I said in the article, I could not find one when looking at AVX2 or NEON reference manuals. Getting the data from RAM is the main bottleneck for short strings, which DNS names usually are.
- dragontamer 2 years ago
  
  You're not thinking in terms of SIMD.
  If you load 5 bytes into a 128-bit register (16-bytes) you'll have 5-bytes of data + 11-bytes of garbage.
  Perform the calculation over all 16-bytes. Yes, this makes 11-bytes of garbage at the end. Once you're done, just write back the first 5 bytes and you're set.
  The 11-bytes of garbage are "free". They didn't cost you anything.
  > Getting the data from RAM is the main bottleneck for short strings
  Your L1 and L2 cache-lines are 64-byte minimum read / write anyway (actually, some systems were 128-byte minimum IIRC). A 16-byte read is literally smaller than what your cache does on the path into your registers.
  EDIT: More importantly, modern CPUs only have 2x or 3x load/store units. Meaning a modern CPU can only perform ~2ish load/stores per clock tick. The SSE (16-byte / 128-bit) read / write to L1 cache will perform the same speed as a 8-byte /64-bit read/write to L1 cache in practice.
  
  fanf2 2 years ago
  
  In fact the code that I left out of the blog post does almost exactly what you suggest :-) It’s the “load 5 bytes” and “store 5 bytes” that I don’t have a good solution for. At the moment I am using memmove() and relying on the compiler developers to have better ideas about optimizing it than I do… The bottleneck comes from the number of instructions and the branchiness, not the data bandwidth.
  I briefly considered playing games with overlong reads, but then asan gave me a slap and I reconsidered the path to wisdom.
  
  dragontamer 2 years ago
  
  > branchiness
  Memorize the following concept / pattern.
  int select(int choice, int A, int B){ return (A & (~choice)) | (B & (choice)); }
  You can convert any "if(bool)" statement into the above 4 assembly instructions: two ands, an or, and a not.
  This performs a 32-bit way parallel "choice" operation. For example, "choice = 0x00000000" will choose all bits from A. "choice = 0xFFFFFFFF" will choose all bits from B.
  SIMD because "choice = 0xFFFF0000" will choose the first 16-bits from B, and the last-16 bits from A.
  ------
  Any data-operation can be "simulated" with enough code operations, and vice versa. Obviously, use code when code is needed, and data when data is needed. But sometimes, you need to perform a "code/data" switcheroo and the above pattern helps.
  EDIT: It doesn't always work. The combinational explosion of data may prevent such a technique from working in the general case. But still, its a good first step and "most if statements" are simple enough to work with the above pattern.
  
  fanf2 2 years ago
  
  For loading the rag end of a string, the branchiness looks like
  while (len-- > 0) { /* load */ }
  or
  switch (len) { /* 8 cases */ }
  https://en.wikipedia.org/wiki/Duff%27s_device
  
  nwallin 2 years ago
  
  SSE4.1 added `blend` which does this for most data types. The lone exception is 32 bit integers for whatever reason, which wasn't added until AVX2.
  
  dragontamer 2 years ago
  
  > It’s the “load 5 bytes” and “store 5 bytes” that I don’t have a good solution for.
  You build that out of "load 16 bytes" and "store 16 bytes" operations.
  register = [news.YCOMBINATOR] isValid = [00000FFFFFFFFFFF]; // "F" is really 0xFF, but you get the gist // isValid was generated because you were focusing on the YCOMBINATOR part // of the DNS entry: news . YCOMBINATOR . com. YCOMBINATOR there to focus // upon the current data being considered, while "news." and "com" should // remain untouched register = (register | ((register >= 'A' & register <='Z') << 5)); register = register & isValid; toWrite = writeLocation & (~isValid); toWrite = toWrite | register; writeLocation = toWrite;
  If you're doing things in-place, then writeLocation could be a copy of "register" from the 1st step.
  
  dzaima 2 years ago
  
  Doesn't work if you don't have read access past (or before) the segment you want. And, as GP says, asan won't like reading/writing out-of-bounds (even if temporarily) either. Probably very unlikely to happen, but that's not really something you want to be betting on.
  
  dragontamer 2 years ago
  
  I get it now. Thanks for explaining. That particular problem is solved by rounding up to the SIMD-width for all SIMD-allocations.
  Ex: 13-size gets rounded up to 16-sized alloc. 25-size gets rounded up to a 32-sized alloc.
  This, in combination of "processing of garbage" and "Valid" flags, should cover all the cases.
  EDIT: At least, for the "dynamic" strings that are loaded into memory. Someone elsewhere has brought up compiler strings, which seem complicated to me (I'm not sure if we have any alignment guarantees, and therefore are forced to process strings sequentially, one byte at a time if they're from the compiler).
  
  janwas 2 years ago
  
  I agree padding is a good solution when you can arrange it.
  In Highway we also use the blending/bit-select you mention: https://github.com/google/highway/blob/master/hwy/ops/generi... If HWY_MEM_OPS_MIGHT_FAULT, which is true if using ASAN or AVX2 (because AMD reserves the right to fault even if mask=0), we actually use a scalar loop.
  I'm curious if anyone has data indicating that a jump table is faster, enough to make a difference?
  
  dragontamer 2 years ago
  
  > I'm curious if anyone has data indicating that a jump table is faster, enough to make a difference?
  Anything that touches the branch predictors on these modern, crazy CPUs, is going to be too complicated to really understand from a performance perspective!
  I'm pretty sure that jump tables / if statements all touch the branch predictor. Things will be high performance if the branch is consistently predicted correctly, but good luck figuring that out without just running the darn program on a CPU.
  
  dzaima 2 years ago
  
  Between a jump table and a scalar loop, the loop will screw with the branch predictor more.
  A jump table has the potential to be faster by doing SIMD, at the cost of taking more icache.
  
  Const-me 2 years ago
  
  > It’s the “load 5 bytes” and “store 5 bytes” that I don’t have a good solution for
  Strings in C are null terminated. You can avoiding loading exactly 5 bytes. For a string of length 4 load 4 bytes, for a string of length 5 load 6 bytes, you’ll read 5 characters of the payload + the terminating '\0'.
  If you code C++, the standard library was fortunately designed with C interop in mind, strings form the standard library have c_str() method.
  About how to do that efficiently, a good way is a jump table. Not all compilers are smart enough to compile switch into a table, but many of them are. Modern GCC usually does the right thing, take a look: https://godbolt.org/z/sK7jcoooa (untested)
  Partial stores are pretty similar. Compare the vector of bytes for b == `\0`, use _mm_movemask_epi8(), then _tzcnt_u32() or _BitScanForward() to find the length, then in the switch use various extract instructions: _mm_cvtsi128_si32, _mm_cvtsi128_si64, _mm_extract_epi16, and/or _mm_extract_epi32.
- adrian_b 2 years ago
  
  That is possible in better SIMD instruction sets, like Intel/AMD AVX-512 or Armv9 SVE2.
  In such instruction sets not only load and store but also most of the other instructions can operate on a subset of bytes/words/dwords/qwords of the registers, by using bit masks to specify which parts of a vector register are used in an operation.
  It is possible to specify whether the unused parts should retain their previous value or they should be set to zero.
  Moreover, for load and store, it is possible to load or store some number of bytes or of larger data from non-sequential locations, using the so-called gather and scatter instructions.
- dzaima 2 years ago
  
  Doesn't seem to be for SSE & AVX at least (only 32-bit and 64-bit groups in AVX, and nothing for SSE). AVX-512 has _mm512_maskz_loadu_epi8, but not much has AVX-512.
  It's a shame that there aren't guarantees on being able to overread some number of garbage bytes. (for a SIMD-heavy project, I've just went with a custom allocator that guarantees the ability to read past the end of allocations, and storing by either maskstore for 32/64-bit, and blending with what's already there for 8/16-bit)
cratermoon 2 years ago

> on today's systems
BIND still supports critical internet infrastructure on less capable hardware.
fulafel 2 years ago

There's some uncovered ground between the processors having SIMD support and being able to use it portably from C. Also a compiler could compile this code to use SIMD instructions.
moonchild 2 years ago

> modern SIMD instructions execute at one-per-clock-tick
Or even two, or four on apple arm, plus load and store which can be done in parallel.
- dragontamer 2 years ago
  
  Yeah, good point. SIMD is also superscalar on most systems.
  For x86, Intel can do 3x AVX512 instructions per clock tick as long as each instruction is simple enough (add, AND, OR, NOT, XOR, maybe even multiply if you're not counting the latency issue)
  
  moonchild 2 years ago
  
  Only port 5 and port 0/1 can do avx512 instructions on skylake/icelake, so I don't think you can get better throughput than 0.5 on current parts. Unless you count load & store as well, like I mentioned.
  
  moonchild 2 years ago
  
  (Unless you count an avx512-specific instruction operating on a 32- or 16-byte vector as an 'avx512 instruction'.)
  
  mhh__ 2 years ago
  
  Skylake is 7 years old
  
  moonchild 2 years ago
  
  And icelake is 3 years old. If you have a newer avx512-supported chip which can do more than two 512-bit alu ops per cycle, I would love to take a look at it.
  
  mhh__ 2 years ago
  
  throughput and latency did improve with alder lake (...), but i see your point.
kristianp 2 years ago

Google and wikipedia tell me that SWAR is: SIMD within a register, also known by the name "packed SIMD".
Marat_Dukhan 2 years ago

Linux-capable RISC-V cores often have 64-bit architecture and no SIMD/vector processing capabilities.

zX41ZdbW 2 years ago

ClickHouse has SIMD implementation for lower/upper for ASCII and UTF-8.

It can easily make ~20 GB/sec with 48 cores: https://pastila.nl/?0f15f812/179361234a6b79d27e18c8db7050bd0...

Although it is only using SSE2. And the UTF-8 variant is using a shortcut with SSE2, another shortcut for a subset of code points and a slow path for a generic case.

mananaysiempre 2 years ago

This technique of SWAR (“SIMD within a register”) with very narrow elements and no hardware support is explained very well in an SO answer[1] (and the linked slides) about the mythical (and useless) “fusion trees”.

[1] https://stackoverflow.com/a/56604876

conradludgate 2 years ago

A few weeks ago I optimised[0] the Rust lower/upper case conversion methods to use more SIMD features. In the end, we took a very conservative level of unrolling since we deemed it unlikely that large inputs would need case conversions.

[0]: https://github.com/rust-lang/rust/pull/97046

Aissen 2 years ago

I see you used criterion here on your own repo, but it really shows that rust is missing the benchmark testing (I know there's one in unstable); having the tests you wrote alongside the code would prevent future hotpath regressions.

dzaima 2 years ago

Here's an AVX-2 implementation that assumes it can read up to 31 bytes past the end of the input: https://godbolt.org/z/r33h75j5h (yes, that's autogenerated C)

Requires -fno-unroll-loops as otherwise clang gets overly unroll-y; the code is fast enough. Tail is dealt with by blending the originally read value with the new one.

babelfish 2 years ago

Great article. Have been brushing up on bit manipulation for interview prep lately and I love finding easy to digest, real-world use cases like this.

mrlonglong 2 years ago

If UTF-8 is used, this will not work.

arthur2e5 2 years ago

No UTF-8 in DNS. Well, not any more.

Groxx 2 years ago

What is the canonical lower case form for domains anyway? Lowercasing is locale-sensitive.... or does it just lowercase the punycode and call it "done"?

fanf2 2 years ago

https://www.rfc-editor.org/rfc/rfc4034.html#section-6.1
- Groxx 2 years ago
  
  just ASCII characters, eh? consistent at least, I guess.

londons_explore 2 years ago

All those arithmetic calculations depend on the previous...

You should get much more throughput if you can interleave them with other instructions...

I wonder if that's what the benchmarks did?

dzaima 2 years ago

An out-of-order CPU will interleave things for you. Any modern CPU should have no issues running many iterations of the loop in parallel, as nothing should depend on the previous iteration.
But, given that the task at hand usually deals with very short inputs, there's not much to interleave with anyway.
fanf2 2 years ago

Hmm, yes, there are only 2 or 3 instructions that can execute concurrently - but your observation made me realise I can eke out another by masking is_ascii more thoroughly. Thanks!
Bulk throughput isn’t really my aim, it’s just a convenient way to get numbers big enough to be easily measurable :-)

MrYellowP 2 years ago

Had I known anyone cares about this ...

Anything else anyone cares about, that could use a speedup... ?

moonchild 2 years ago

Produce a fast, in-place, ideally parallel implementation of the generalised transpose, which arbitrarily rearranges the axes of a multidimensional array (potentially duplicating some of them).
- MrYellowP 2 years ago
  
  Wow! Okay, but does it have to be a general transpose? And if so, why?
  I have no idea about this (which is PERFECT!) so I've started looking:
  > In linear algebra, the transpose of a matrix is an operator which flips a matrix over its diagonal; that is, it switches the row and column indices of the matrix A by producing another matrix, often denoted by AT (among other notations).
  I ask, because I need to know if it matters how the solution works, or if all that matters is that the solution works as it should.
  Thank you!
  Edit: Do you have a compiled benchmark I can run, for comparison with other solutions?
  
  moonchild 2 years ago
  
  > does it have to be a general transpose? And if so, why?
  There are two answers to this, one less interesting, and one more interesting.
  The less interesting answer is that it's a primitive in APL, and I have a vested interest in making APL fast. Then, it needs to be a general transpose because the specific case of a 2-d transpose is already well-studied and has high-quality solutions already.
  As a language feature, the transpose can be thought of as a calculus on indices. This suggests a fairly straightforward implementation strategy where no actual data is moved around; instead, a 'strided representation' is used, storing additional information about the array layout, and the indexing procedure is made aware of this. This seems to be effectively free. So why am I asking you to move data around?
  This leads into the second interpretation of the transpose, which is as a permutation. Interestingly, this is almost opposite the index-calculus interpretation, because it primarily acts to change the way the data are represented; it doesn't actually need to change how the data are interpreted at all. And this is useful because, when taking into account access patterns, laying the same data out differently can significantly improve locality and therefore performance. If you're familiar with column/row stores, or 'soa'/'aos', these are applications of the 2-d transpose. The general transpose, then, performs the obvious analogue function for higher-dimensional data and is interesting for the same reasons.
  (The duplicating-axes part is just a novelty, then.)
  As a starting point, I suggest looking at libfft; as I recall, it has some good 2-d transpose implementations, and links to relevant papers. (Fft is, not-so-coincidentally, an application that can benefit from transposing.) Also interesting may be this paper, and any relevant links that result therefrom: https://www.researchgate.net/publication/273912700_In-Place_...
  > I need to know if it matters how the solution works, or if all that matters is that the solution works as it should
  The interface is: I give you a buffer, an element size (probably 1/2/4/8 bytes), a shape projecting a multidimensional interpretation onto that buffer, and a permutation of that shape; you move around the elements of the buffer according to the permutation. How you implement that is up to you (though I will also say that I am also more curious about the general approach you come up with than with the specifics of the implementation).
  > Do you have a compiled benchmark I can run, for comparison with other solutions?
  The only place I know of to find implementations is inside of apls; the fast apl implementations are dyalog apl[0] and j[1]. I don't know how optimised their transposes are, though; they might or might not be a good baseline. Here's a simple example to get you started benchmarking inside j:
  a=. 27 1000 40 77?@$0 NB.declare a to be an array of shape 27 1000 40 77 NB. contents are randomised; this is a memory-heavy algorithm, and apparently writing lots of zeroes can have weird cache effects NB. elements are double-precision floats; so, 8 bytes timex'2 1 0 3|:a' NB. see how long it takes to apply the permutatio 2 1 0 3 to the shape of a 0.34551 10 timex'2 1 0 3|:a' NB. do 10 runs, and take their average 0.301104
  Dyalog is similar, but the permutation is specified in a weird way and I don't feel like looking it up atm; poke me about it later if you care to, as they do have better algorithms in some cases.
  0. https://www.dyalog.com/
  1. https://www.jsoftware.com/
  
  MrYellowP 2 years ago
  
  I'm sorry, but this is a misunderstanding.
  You give me a problem, tell me the desired solution and I figure it out on my own.
  What I can provide you with is an optimized, compiled block of assembly code, with source, which solves your problem fast.
  What I require is a problem (you've stated it) and how the solution is supposed to look like/how the desired output is supposed to be presented/etc.
  I feel like you didn't actually do that.
  I guess the software you've linked is supposed to help me; I'll have a look at that. Usually, though, the problem itself is enough, though. Looking at other peoples work messes with creativity.
  Anyhow, can you execute a block of compiled assembly code? Then all I need is access to the data, how it's laid out and what it is you actually want the code to do. In return you'll get a wacky, working solution you can just plug in.
  If you can't call compiled binary directly, there's still a way around that using shared memory and having my code run in a separate process.
  Anyway, I've thought about this. You definitely want the data to be moved? You don't want just more efficient access to it? That'd been trivial to do. If you don't want that, because of cache reasons, then there's a trivial way of not destroying the cache, as long as you know how to access registers directly. Or I can still drop it in-place, I guess.
  I'm sure we'll get there. I guess next time I need to lay out what's actually "in store".
  
  moonchild 2 years ago
  
  I thought I was pretty clear
  > The interface is: I give you a buffer, an element size (probably 1/2/4/8 bytes), a shape projecting a multidimensional interpretation onto that buffer, and a permutation of that shape; you move around the elements of the buffer according to the permutation
  but if not, please tell me!
  The remainder is just providing context for the problem, but you don't have to look at it if you don't care to.
  > there's a trivial way of not destroying the cache, as long as you know how to access registers directly
  What do you mean by this? Nt accesses or similar? Those can be helpful in some specific cases, but are not really a general solution. Esp. if there is spatiotemporal correlation between accesses (as is often the case), and you want to be able to take advantage of the caches.
  
  MrYellowP 2 years ago
  
  You're right. You were.
  I wasn't. I was too intimidated, because i'm not a "professional" working in the industry, thus I have little to no knowledge about what others are doing and often terms that others are using. Like how I had no idea what "general transpose" means ... but now I know.
  > The interface is: I give you a buffer, an element size (probably 1/2/4/8 bytes), a shape projecting a multidimensional interpretation onto that buffer, and a permutation of that shape; you move around the elements of the buffer according to the permutation
  I can work with that! Do you have an "in" and "out" example as a reference,
  or should I just make my own?
  What CPU are we talking about?
  How does the shape look like?
  Do you have example data I can start working with?
  I've never worked with other people!
  Thanks!
  PS: Does it really have to be in-place replacement?
  
  moonchild 2 years ago
  
  No worries.
  > I had no idea what "general transpose" means ... but now I know
  FWIW pretty much no one else in industry knows what a transpose it, let alone a general one. It's rather obscure :)
  > What CPU are we talking about?
  amd64 with avx2 is probably the most important target.
  > How does the shape look like?
  > example
  The shape is a list of natural numbers. Its length is the number of dimensions in the array, and each element is the length of the corresponding axis. For instance, suppose the contents of the array are the numbers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23. Then, if the shape is 2 3 4, that is a 2x3x4 brick. The length of the array is 24 (being the product 2*3*4), and the structure may be elucidated by the following tabular display:
  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
  Applying the permutation 1 2 0 to this shape yields an array of shape 3 4 2, and requires the following result:
  0 12 1 13 2 14 3 15 4 16 5 17 6 18 7 19 8 20 9 21 10 22 11 23
  > Does it really have to be in-place replacement?
  Well, that's the trick, isn't it? :P
  
  MrYellowP 2 years ago
  
  AMD! Convenient, because I'm running a 5900HS right now. :D
  Wait ... you want to be able to pick the permutation? I thought it's about a fixed transposition, but you want to choose? lol
  What a complicated mess! 234 ... two blocks of data, containing three lines of data, containing four entries of data.
  I'm not sure the example is sufficient for me to understand. I hope I can figure this out for other cases than 1 at the start, which seems to be a bad example for understanding this. if 1 was 2, do I instead, then, pick every second number?
  I'm trying to wrap my head around this, starting with wikipedia.
  Why don't you use a better data structure? Seems rather wasteful, CPU-wise, not coming up with some more generic structure that makes it easier to permute?
  But ... I guess that's not allowed, otherwise you'd not need in-place replacement.
  On the other hand, as long as I store everything in-place it shouldn't matter, as long as the result is correct.
  But are you sure you need in-place replacement as long as you can get the correct, permutated result? Do you request the same results more than once?
  Yes, I keep asking again looking for a way to avoid it. I have to make sure I cover everything and know the desired outcome exactly. Like, if you just care about the result itself, then that's different to having to store it in memory.
  Hm. Given that there's some sort of memory limitation ... How much space do I have, in bytes/kbytes, for code? Can I allocate my own memory?
  Questions, questions, questions. Details, details, details.
  
  mlochbaum 2 years ago
  
  Figured I may as well start writing a notes page on Transpose instead of just replying here:
  https://mlochbaum.github.io/BQN/implementation/primitive/tra...
  (EDIT: missed a 0 in the first version and got the wrong timing for Dyalog)
  For the particular case you give, the axis permutation swaps two axes, so it's invertible and the same left argument applies in both J and APL (may require adjustment for ⎕IO of course). I measure Dyalog the same as J, 0.3s. This array has a large fixed axis at the end, meaning that chunks of 77 floats (over half a kB) stay together, and the best way is just to move each of these with memcpy. Seems both languages miss this optimization? 0.3s is only 2.2GB/s. The array's definitely too large to fit in even L3 cache but this seems too slow for an in-memory copy.
  
  moonchild 2 years ago
  
  Thanks.
  > cache associativity
  This one is annoying. Nt accesses might help, though probably not for the in-place case. You mention padding; the paper I linked uses padding as well, to shorten cycle length for inplace+parallel. Although, I would guess that dimensions which are nice to the cache will be not-nice to the blocking, and vice versa
  > Of course larger loops are possible as well. But most of the time it's really the base case that's important. The base case also handles two axes most of the time, but can incorporate all the 2D optimizations like blocking.
  Unless I am misunderstanding--reversing the axes of an array of shape 1e6 1e3 2 2 will have an inner loop dealing with the two innermost axes, and some extra dispatch outside of that. Which is not ideal, obviously--the inner loop only deals with 4 elements.
  
  mlochbaum 2 years ago
  
  No, I think they're both using memcpy. perf says they're spending all the time in libc at least. The total time is a little slower than operations like rotate that I know are using memcpy (0.18s), so the chunk size does seem to introduce some overhead.
  
  moonchild 2 years ago
  
  Glibc strings functions are ok, but not great (but one-size-fits-all is hard, and for all I know they could be fine here).
  While 512 bytes seems like a lot, the core loop is likely 4x unrolled avx; that is 128 bytes per iteration, so only 4 iterations. And it probably tries to align the dst, so annoying dispatch overhead (which you could avoid by working in batches of k at a time, eg k=4 for double floats and avx in the general case).
  
  mlochbaum 2 years ago
  
  The big difference is the access pattern: see the benchmarks below. Index does the small memcpys, but it speeds up if the indices are in order (my earlier benchmark used ⌽, not ⊖, because I don't remember APL any more). So prefetching might help. I guess it's possible that a larger-scale blocking would too?
  But 5GB/s for ⊖ isn't great either. In an application that uses these huge arrays and needs the best performance (most don't!), it needs to be split up, ideally into sections that fit in L1, so that multiple array operations can be applied to those chunks and stay CPU-bound (at least for transpose, maybe not for arithmetic). That's why I wouldn't be too interested in optimizing this case.
  ⎕IO←0 ⋄ ar←,[0 1 2]a←?27 1000 40 77⍴0 ci←(⊢⍴∘⍳×/)3↑⍴a ⍝ cell indices i←⍉ci ⋄ cmpx '2 1 0 3⍉a' 'ar[i;]' 2 1 0 3⍉a → 2.8E¯1 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ ar[i;] → 2.8E¯1 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ i←5⊖ci ⋄ cmpx '5⊖a' 'ar[i;]' 5⊖a → 1.2E¯1 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ ar[i;] → 1.7E¯1 | +46% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  
  MrYellowP 2 years ago
  
  Well, I guess you won't be responding anymore. Sorry, I can't wrap my head around the problem properly and it takes way too much time figuring it out, which shouldn't be necessary.
  The example given is insufficient. I've tried figuring out how this whole thing works, but I didn't find any actually good explanation that's not written for mathematicians.
  I've read through the other comments and it even looks like people are approaching this problem mathematically, which makes no sense to me.
  There's someone mentioning that his code apparently uses memcopy and I'm wondering if he actually really knows what he's doing, because that's just massive overheard for elements which - as I believe - fit into registers.
  Compilers don't write the best code out there, period. After several generations of this attitude, how would people even know, if all they ever do is having blind faith that the compiler will do their job?
  Anyhow, I digress into ranting ...
  I have several solutions in mind, but apparently I require actual data and at least a few examples that aren't so simple that they don't actually reflect the desired outcome.
  I'm sorry about that. It's just how I work. I need something to work with and the scope of this is way beyond reasonability, because all the learning involved does not appear to be at all required for solving the problem.
  So, instead of doing it myself (I'd still love to), I'll just tell you the solution I came up with.
  To me, this problem seems solvable just by writing code dynamically. Based on how you want the result to be laid out, you just drop blocks of assembly code meant to perform the tasks required to solve the problem.
  Tasks, like, "read the element from memory" and "write it somewhere else".
  There's room for exploration, by trying different methods of reading and writing the elements. There's more ways than mov to do the job.
  How all of this would work:
  You know your memory access patterns, both for reading and for writing.
  You can sort the access patterns based on linear locality. If it boils down to one "cacheline per read" and you're not allowed to change the structure of the data to make things easier on the cache, then that's just how it is and there's no way around it.
  With your sorted access patterns, you start writing the blocks of adjusted (memory addresses) compiled code into your executable memory.
  Given that there's different ways of doing the reads/writes, you could do what I'd do and have several different blocks of code to experiment with. Like, one can abuse push/pop for memory transfers, eight byte per.
  If you want to be really fancy you'd make a benchmark shuffling around the access patterns (the blocks of code and its respective variations) until you find the quickest, assuming there's any practical value for you in doing so.
  I know I would, because this problem can be brute-forced.
  Now you execute your created block of code, which then rewrites it all in place, hopefully making proper use of pipelining.
  And if you want to go really, really fancy ... you do it multithreaded.
  I can see that working ... and I can see room for tinkering. Further exploration follows after a working prototype.
  Well ... so far that's all I have. From my perspective this problem is solvable, because it's just memory accesses and, including exploration of possible optimizations, clever use of registers.
  Looking at it from a mathematical perspective and trusting the compiler to figure out how to do it quickly doesn't at all appear to me like it's going to cut it.
  Thank you for coming to my TED talk.

charcircuit 2 years ago

Since domains are UTF-8 doesn't the assumption of ASCII breakdown?

sveiss 2 years ago

Domain names are limited to a small subset of ASCII; the DNS protocol doesn’t support UTF-8.
To work around this, the international domain name standard defines an encoding called Punycode which maps Unicode to the limited character set DNS supports. The server is unaware of this, and so this optimised tolower() implementation works without any Unicode considerations.
- asveikau 2 years ago
  
  Lower case is locale-sensitive, however. For example, tolower('I') in Turkish should be 'ı'.
  And within unicode, doing it in a "dumb ascii" way probably needs some normalization of diacritics. Eg. 'é' should be U+0065 U+0301 ("e\xcc\x81"), not U+00E9 ("\xc3\x89").
  Not sure how punycode handles this, I did once look deeply into it but that was years ago.
  
  lifthrasiir 2 years ago
  
  Punycode normalizes the case and other Unicode stuffs with the PRECIS Framework [1] before the label ever gets encoded to an ASCII-Compatible Encoding---those get prefixed with `xn--`. The root zone and top-level domains have an additional set of script-specific restrictions as well (RZ-LGR [2] in particular).
  [1] https://datatracker.ietf.org/doc/html/rfc7564
  [2] https://www.icann.org/resources/pages/root-zone-lgr-2015-06-...
- charcircuit 2 years ago
  
  Can you link me to where thin is specified. From what I can tell there is no requirement to use ASCII in domain names.
  
  sveiss 2 years ago
  
  Section 2.2 and 2.3 of RFC 5890 [1] has a good overview of the pre-IDN state of things, with reference to where these were originally specified. In particular, the definition of 'LDH label' describes a single component of a traditional DNS name:
  This is the classical label form used, albeit with some additional restrictions, in hostnames [RFC0952]. Its syntax is identical to that described as the "preferred name syntax" in Section 3.5 of RFC 1034 [RFC1034] as modified by RFC 1123 [RFC1123]. Briefly, it is a string consisting of ASCII letters, digits, and the hyphen with the further restriction that the hyphen cannot appear at the beginning or end of the string. Like all DNS labels, its total length must not exceed 63 octets.
  I misspoke slightly in my earlier comment: it turns out the DNS protocol does allow octets of any value in a label, but the Internet domain name system -- including all the existing clients, servers, registrars, etc -- as a whole does not. Which is part of why we need a fairly complicated set of specifications to make non-ASCII domain names work.
  [1] https://datatracker.ietf.org/doc/html/rfc5890#section-2.2
mgcunha 2 years ago

DNS labels are limited to some ASCII characters. UTF support in DNS is available via punycode which is an encoding from UTF onto that restricted ASCII set acceptable for DNS. Libraries as the ones discussed here typically perform UTF to punycode conversions before doing any label comparisons to ensure accuracy. In particular this tolower implementation would likely be used against the punycode encoded version of the UTF domain.
https://en.wikipedia.org/wiki/Punycode
jimmygrapes 2 years ago

Absolutely. This article is reminiscent of the type of brief tutorials from the mid/late 90s in QBASIC newsletters introducing how to use ASM to get speed gains, with all the associated assumptions. The concept of "lowercase" breaks down terribly once you go beyond simple A-Z. You can apply similar rules to certain limited code sets, but a universal tolower() won't work so well with this method.
- jandrese 2 years ago
  
  "Universal tolower()" doesn't even make sense. What would you expect from tolower("Σ"), "σ" or "ς"? If you assume tolower(toupper(mychar)) == tolower(mychar) then your code is going to break.
  
  jimmygrapes 2 years ago
  
  yes that is my point

rep_movsd 2 years ago

Would it not be simpler to bitwise OR every 5th bit to one based on a byte mask for uppercase chars?

Also, use SIMD - I think even AVX can do byte level paralellism for 256 and 512 bits at a time?

londons_explore 2 years ago

Worth patching libc?

Findecanor 2 years ago

This type of bit manipulation will only work with pure ASCII, which had been designed to make this transformation simple with only bit manipulation — just not in parallel.
Most systems default to using Unicode these days, for which the problem is much more complex even when the language is set to English.
- xxpor 2 years ago
  
  I was really disappointed when I found out the author didn't have to deal with the dotless I problem.
mananaysiempre 2 years ago

The standard tolower() / toupper() take a single character only; bulk strlwr() / strupr() are nonstandard (if commonly present) and—more importantly—virtually unused. I suppose implementing this technique in the C/POSIX-locale version of nonstandard stricmp() / POSIX strcasecmp() might be helpful in some cases, because people do use that one, but I still expect that any situations that truly call for an ASCII-only case-insensitive comparison (parsers?) will be doing much more work per byte for other reasons (state machine, perfect hash, etc.).
stabbles 2 years ago

It's not the same, libc's tolower takes an int as an individual character
- Findecanor 2 years ago
  
  Although ... the existence of a libc with SIMD versions of its functions is not implausible. There are compilers that would produce such functions beside the normal ones if the source is decorated with the right #pragma.
  Such functions would be called only from within vectorised loops (or other SIMD versions of functions).
  
  moonchild 2 years ago
  
  Indeed. For instance, glibc's math library has vectorised trig functions, and gcc can generate calls to them if it sees fit. For the rest of us, there is sleef.
vinkelhake 2 years ago

libc tolower() works on a single char at a time. This post is about the gains you can get by converting multiple chars at once.
- londons_explore 2 years ago
  
  Presumably with the right __inline stuff it could be vectorized by a compiler, since tolower is normally called from a loop?
  Although I guess since it's often a dynamically linked library, said optimizations would have to happen at load time, and the dynamic loader doesn't do such things... :-(