Well, then you may also appreciate changing my Nim program from the `continue` to the final output `for` to:
let mf = mopen(path)
if mf == nil:
stderr.write "wc: \"",path,"\" missing/irregular\n"
else:
var word = ""
for c in toOpenArray(cast[ptr UncheckedArray[char]](mf.mem), 0, mf.len-1):
let c = c.toLowerASCII
if c in {'a'..'z'}:
word.add c
else:
count.maybeCount word
if word.len > 0: word.setLen 0
count.maybeCount word # file ends in a word
mf.close
You also need to add ",cligen/mfile" to the import line and install a version control head of the cligen package.
With that the Nim time goes down to 6.1 ms (no PGO) and 5.0 ms (PGO). Just a quick after dinner fix-up.
{ There is actually a trick related to ASCII where we could add `, 'A'..'Z'` to the character set and then `or` in a bit to make the letter lowercase (because 'a' - 'A' is 32), but I just did the above to test that cligen/mfile worked with FIFOs in the mix, not to maximize performance. I suspect pre-sizing the table by adding e.g., 10_000 or 20_000 to the init would make more difference.}
Beyond the above couple ideas, the next steps to make it faster would be to get rid of almost all dynamic memory allocation by having a table of byte-offset,len,count triples (or maybe add in a hashcode for a quadruple) and rolling your own "memcasecmp" if your stdlib does not provide one. This could probably make the histograms so small that they are likely fully L2 resident..roughly (8..12) times vocabSize bytes is 192 KiB for a 16384 vocabSize. https://github.com/c-blake/adix/blob/master/tests/anaPrime.n... has an example of how to do this in Nim (for values in a table, though in our case here it would be for keys and so need `hash` and `==`).
At that point if you had a lot of data, the better way to parallelize it is also not either of the two ways DGA mentions in his original blogpost with one shared table, but rather one histogram per CPU with no locks or memory contention at all and a final merge of the CPU-private histos (after processing all files) that can probably be single-threaded, though you could also organize it as a "tree" of parallel merges (e.g. first merge every pair, then every pair of the results and so on).
Whether to parallelize over files or over big chunks of files would depend upon how big your files are. The big chunks approach leads to more regular work balance because making all subproblems the same size is easy, but does need "clean-up" for any words that span each byte boundary which is a bit annoying, but there are only as many fix-ups as you have threads which is small. It would be unsurprising to me if Tale of Two Cities could be done in like 0.1 ms with 32 cores up to 500x faster than that initial Rust version (maybe omitting thread start-up overheads).
This particular problem is actually so fast that you might need to use all of Dickens cat'd together or more to get numbers less polluted by system noise and microsecond-scale thread start-up times and small L2 histo merges or do min-of-more-than-5 runs for times, but all the principles I've mentioned apply more broadly (except maybe the ASCII case trick).
Anyway, these are all sort of obvious points on a topic that was more about prog.langs, but a good systems programming language + ecosystem ought to make it "easy-ish" to get that last factor of 10..500x that is easy enough to express in a HackerNews comment.