The paper (https://static.usenix.org/publications/compsystems/1993/win_... linked to from the page) is well-written and worth reading, iterating over multiple C programs that build up to this sort algorithm and giving a good rationale for each iteration.
I don't actively use it, unfortunately. The main inspiration was to faster sort inputs into the https://github.com/BurntSushi/fst crate, which I in turn used to try to build a search library.
Sorts like these work really well on GPUs. You would start with a single thread for the first pass and then keep spawning one-thread-per-bucket for every additional pass until done. It isn't particularly taxing on the GPU; your advantage comes from being able to run potentially thousands of threads simultaneously without having to worry about coordinating read/write memory access.
> The name American flag sort comes by analogy with the Dutch national flag problem[2] in the last step: efficiently partition the array into many "stripes".
Ignoring the fact that if you have Unicode encoded strings, you probably want some sort of lexical sort rather than a sort on the raw Unicode character codes which is semantically meaningless, there is no reason to assume that Unicode data would be any different to sort than any other data.
Unicode collation is very complex. If you wanted to sort according to that, then you would need to devise an isomorphism between unicode strings and bitstrings such that lexicographic ordering in the bitstring domain agrees with unicode collation in the unicode domain. I'm not saying it's impossible, but it's not at all obvious how to do it. On the other hand, if all you want is some total order for acceleration structures, then you can indeed just use an arbitrary encoding like utf8 and do the obvious thing.
> The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system and language that can be represented with Unicode. These keys can then be efficiently compared byte by byte in order to collate or sort them according to the rules of the language, with options for ignoring case, accents, etc
The obvious solution to this is to pre-compile a list of all possible Unicode strings up to the required length, sort them, and create a lookup table using their indices. I wonder for what kind of project this would be useful.
Seems like it should work for arbitrary byte strings (any charset, any encoding)but obviously the performance characteristics will differ because of non-uniform distribution. But that happens even in ASCII.
Lexicographic encoding of UTF-8 byte sequences matches lexicographic order of the sequence of Unicode code-points. So you can sort UTF-8 strings as byte strings. Not that sorting by code-points has much meaning, but you can use the Unicode collation algorithm first.
TIL, thanks. That makes sense given how the length prefixes look like but I never thought about it. I wonder if this was by chance or if the UTF-8 creator thought about it.
No. The very article you link details the difference, and that Counting Sort is often a subroutine in Radix Sort, and that Counting Sort is extremely poorly suited to strings, which is the entire purpose of this sort. And this name is better. So... "no" in basically every way possible.
Characters in a string can be thought of as digits in a base-256 number. Call counting sort recursively on each bucket, looking at the next character in the string. Can you not see the similarity?
There is no one who fails to see the similarity, because they're both kinds of radix sorts. For instance, a counting sort is a degenerate form of radix sort that only does one pass.
But where the classic radix sort is bottom-up, the American flag sort is top-down. That really matters for data where the the most significant portions of the data is likely to be all you need to consider for the sorting. While a top-down sort will have similar performance to bottom-up in the worst case, in the best case it can skip a lot of passes, especially if the inputs are much longer than the longest common prefix.
They're related, but in a very specific way that does not meet the bar for "variation" in my opinion. As you described, counting sort is a very simple procedure that works well only on a very specific input domain. If another algorithm has "more logic" than counting sort itself in order to transform a very different input domain into a format that is suited for making many repeated calls to (a procedure that may be implemented as) counting sort, in a specific pattern, and appropriately coalescing the results of those calls, I think it is appropriate to let it have its own name. Would you prefer to refer to every possible utilization of counting as "that version of counting sort where...."?
American flag sort is essentially an in-place version of counting sort.
That's talking about the single-pass American flag sort. If you use American flag sort to sort strings, you do multiple passes: on the first pass, you sort all strings by their first character, then for each starting character, you sort the strings with that starting character (which are now in a contiguous subarray) by their second character, and so on.
@chowells calls it a “top-down radix sort” below, which is a great description. It also explains the strengths and weaknesses of the two algorithms: radix sort works great for small strings of fixed length (e.g. IPv4 addresses, which can be thought of as 4-byte sequences) while American flag sort works great for variable-length strings like actual textual strings, especially if they don't share common prefixes (e.g. dictionary words, usernames, etc.)
@hinkley pointed out that the recursive version is just a bucket sort, but in-place. Which is also true!
tl;dr:
Single-pass American flag sort = in-place counting sort
Recursive American flag sort = in-place bucket sort
The paper (https://static.usenix.org/publications/compsystems/1993/win_... linked to from the page) is well-written and worth reading, iterating over multiple C programs that build up to this sort algorithm and giving a good rationale for each iteration.
There is currently a typo in the link which prevents it from working - the correct link is https://static.usenix.org/publications/compsystems/1993/win_...
Interesting read, thanks!
Looks like there's a Rust implementation: https://crates.io/crates/afsort
> When sorting English words, this implementation seems to be about 40% faster than sort_unstable from the Rust standard library.
That's better than I would have expected. It would be interesting to see how it does on other types of string sort problems.
Hi! Crate author here, happy to see my old project getting mentioned. Let me know if you have any questions!
Do you actively use this function in any projects? What was your inspiration to write the crate?
I don't actively use it, unfortunately. The main inspiration was to faster sort inputs into the https://github.com/BurntSushi/fst crate, which I in turn used to try to build a search library.
Sorts like these work really well on GPUs. You would start with a single thread for the first pass and then keep spawning one-thread-per-bucket for every additional pass until done. It isn't particularly taxing on the GPU; your advantage comes from being able to run potentially thousands of threads simultaneously without having to worry about coordinating read/write memory access.
At first I didn't think I'd be interested.
> Using some efficiency techniques, it is twice as fast as quicksort for large sets of strings.
I’m not calling it flag sort.
It’s an in-place bucket sort which is interesting.
From Wikipedia[1]:
> The name American flag sort comes by analogy with the Dutch national flag problem[2] in the last step: efficiently partition the array into many "stripes".
[1] https://en.m.wikipedia.org/wiki/American_flag_sort
[2] https://en.m.wikipedia.org/wiki/Dutch_national_flag_problem
I read this rationale before I made my comment.
> From Wikipedia[1]:
...or from the very article itself...
Or in-place radix sort maybe.
I agree with you, the name is ridiculous, reminds me of miracle sort or intelligent design sort: https://www.dangermouse.net/esoteric/intelligentdesignsort.h...
Probably those two flag sort algorithms were named before the joke algorithms were popularily known.
Quicksort isn't great when comparing two items is expensive.
Visualisation https://www.youtube.com/watch?v=k1XkZ5ANO64
Wikipedia page https://en.wikipedia.org/wiki/American_flag_sort
The next video that came up for me was "sorting algorithms to relax/study to": https://www.youtube.com/watch?v=vr5dCRHAgb0
I'm almost embarrassed by how long I've watched it!
No offense, but I don't think that visualization helps very much.
Here's a visualization of a much larger data set that illustrates what's actually happening with the buckets and the major steps.
https://www.youtube.com/watch?v=21jEd1FUiV8
Thanks, this one makes the stripe pattern much more apparent.
Looks more like anarchist flag sort...
I’ll take your word for it, AnarchismIsCool ;)
> Visualisation https://www.youtube.com/watch?v=k1XkZ5ANO64
OT: I quite liked the follow up video Youtube suggested -
Sorting algorithms to relax/study to - https://www.youtube.com/watch?v=vr5dCRHAgb0
Wondering how this sort would perform with Unicode input (given 256 buckets); specifically what the results would look like.
Ignoring the fact that if you have Unicode encoded strings, you probably want some sort of lexical sort rather than a sort on the raw Unicode character codes which is semantically meaningless, there is no reason to assume that Unicode data would be any different to sort than any other data.
Presumably it would be fine, since unicode input is still a string of bytes at the end of the day.
Unicode collation is very complex. If you wanted to sort according to that, then you would need to devise an isomorphism between unicode strings and bitstrings such that lexicographic ordering in the bitstring domain agrees with unicode collation in the unicode domain. I'm not saying it's impossible, but it's not at all obvious how to do it. On the other hand, if all you want is some total order for acceleration structures, then you can indeed just use an arbitrary encoding like utf8 and do the obvious thing.
Isn't that exactly what the Unicode collation algorithm is?
https://unicode.org/reports/tr10
tl;dr from its wikipedia page:
> The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system and language that can be represented with Unicode. These keys can then be efficiently compared byte by byte in order to collate or sort them according to the rules of the language, with options for ignoring case, accents, etc
Ah! You are right.
You can just use the icu library to do that. Its a very common thing to do when working with unicode collations.
The obvious solution to this is to pre-compile a list of all possible Unicode strings up to the required length, sort them, and create a lookup table using their indices. I wonder for what kind of project this would be useful.
If that works for you for anything but very small strings, I want to buy your computer.
Even for 3-codepoint strings it already wouldn't fit in any computer that exists. And that is not sufficient to encode all 1-glyph strings.
It says it sorts one byte at a time. I think this would break for anything not utf-8.
Seems like it should work for arbitrary byte strings (any charset, any encoding)but obviously the performance characteristics will differ because of non-uniform distribution. But that happens even in ASCII.
Yes, you’ll get something sorted based on the bytes in the string but it won’t be lexicographically correct - for example, à will be sorted after b.
This would even break UTF-8, since multi-byte characters are a thing!
How would that break anything? The strings aren't being split.
Lexicographic encoding of UTF-8 byte sequences matches lexicographic order of the sequence of Unicode code-points. So you can sort UTF-8 strings as byte strings. Not that sorting by code-points has much meaning, but you can use the Unicode collation algorithm first.
% printf "%s\n" A B | sort
A
B
% printf "%s" A B | xxd -b -c4
00000000: 11110000 10011101 10010000 10110100 ....
00000004: 11110000 10011101 10010000 10110101 ....
% printf "%s" A B | xxd -c1 -ps | sort | xxd -r -ps | xxd -b -c4
00000000: 10010000 10010000 10011101 10011101 ....
00000004: 10110100 10110101 11110000 11110000 ....
% printf "%s" A B | xxd -c1 -ps | sort | xxd -r -ps
????????
TIL, thanks. That makes sense given how the length prefixes look like but I never thought about it. I wonder if this was by chance or if the UTF-8 creator thought about it.
Indeed, and is à less than b? Not in Unicode!
UTF-8 is downward compatible to ASCII, so anything that is just a standard character (like every character in this comment) is just a byte.
I think unicode is intrinsically antithetical to freedom. This would do great with ASCII.
The stars in the American flag aren't used. This is clearly Pride flag sort.
And Mauritius's flag has more stripes.
There's a Peter M. McIlroy as well!?!
All this time and I thought it was just Doug, I'm guessing Peter might be the son of the father of pipe?
Yes. He is a software engineer in Seattle.
This strikes me that it should parallelize pretty well if you deputise the individual stripes to multiple threads.
today i coded franzSort, faster than mergeSort, slower than quickSort
function franzSort(a) { function m(l, r) { let s = []; while (l.length && r.length) { if (l[0] <= r[0]) { s.push(l.shift()); } else { s.push(r.shift()); } } return s.concat(l).concat(r); }
}// Example usage let arr = [5, 3, 8, 6, 2, 7, 1, 4]; let sortedArr = franzSort(arr); console.log(sortedArr);
Pretty much just a variation of counting sort with a worse name? https://en.wikipedia.org/wiki/Counting_sort
No. The very article you link details the difference, and that Counting Sort is often a subroutine in Radix Sort, and that Counting Sort is extremely poorly suited to strings, which is the entire purpose of this sort. And this name is better. So... "no" in basically every way possible.
Characters in a string can be thought of as digits in a base-256 number. Call counting sort recursively on each bucket, looking at the next character in the string. Can you not see the similarity?
There is no one who fails to see the similarity, because they're both kinds of radix sorts. For instance, a counting sort is a degenerate form of radix sort that only does one pass.
But where the classic radix sort is bottom-up, the American flag sort is top-down. That really matters for data where the the most significant portions of the data is likely to be all you need to consider for the sorting. While a top-down sort will have similar performance to bottom-up in the worst case, in the best case it can skip a lot of passes, especially if the inputs are much longer than the longest common prefix.
They're related, but in a very specific way that does not meet the bar for "variation" in my opinion. As you described, counting sort is a very simple procedure that works well only on a very specific input domain. If another algorithm has "more logic" than counting sort itself in order to transform a very different input domain into a format that is suited for making many repeated calls to (a procedure that may be implemented as) counting sort, in a specific pattern, and appropriately coalescing the results of those calls, I think it is appropriate to let it have its own name. Would you prefer to refer to every possible utilization of counting as "that version of counting sort where...."?
American flag sort is essentially an in-place version of counting sort.
That's talking about the single-pass American flag sort. If you use American flag sort to sort strings, you do multiple passes: on the first pass, you sort all strings by their first character, then for each starting character, you sort the strings with that starting character (which are now in a contiguous subarray) by their second character, and so on.
@chowells calls it a “top-down radix sort” below, which is a great description. It also explains the strengths and weaknesses of the two algorithms: radix sort works great for small strings of fixed length (e.g. IPv4 addresses, which can be thought of as 4-byte sequences) while American flag sort works great for variable-length strings like actual textual strings, especially if they don't share common prefixes (e.g. dictionary words, usernames, etc.)
@hinkley pointed out that the recursive version is just a bucket sort, but in-place. Which is also true!
tl;dr:
Single-pass American flag sort = in-place counting sort
Recursive American flag sort = in-place bucket sort
A superior name.
> it is twice as fast as quicksort for large sets of string
Define large.
They did. But you'll need to read the (free) paper [1] to get the details omitted in the NITS catalog summary.
[1] http://www.usenix.org/publications/compsystems/1993/win_mcil...
[dead]
The comments made me realise there's no national flag with lots of vertical stripes. Seems like a hole in the market.
Someone alert CGP Grey!
Freedom Sort.
If you disagree, see the first amendment! If you still disagree, see the second amendment!
And if you STILL disagree, see the third amendment! If you're a soldier, in my house, I guess.
Cue the Team America theme song
That is basically what I hear in my head everytime I see that flag
For democracy!
I'm doing my part!
would you like to know more?
The 'more information' site linked on the page seems like something some HN folk love.
https://www.workmall.com/flags/united_states_flag.html