BirAdam 2 days ago

This is excellent work, and I love seeing projects in Ada.

nayuki 3 days ago

> Another surprising feature of the BWT is that for reversing the permutation, no extra information is needed!

I think this is not true. The way I learned the BWT, after encoding, you need to store the index of the first character (which is a tiny bit of extra information). https://web.archive.org/web/20170325024404/http://marknelson...

  • amy_petrik 3 days ago

    well here's a big surprise of BWT

    BWT is a transformation transforming a string into "suffix tree space"

    You can do searches - very fast searches - in BZ2/BWT compression space

    Discussed in the '90s in the legendary dr. dobbs

    But more significantly, BWT is the basis for most DNA and RNA sequence aligners nowadays - which perform string search/matching in suffix tree (transformed) space. It's like doing frequency stuff in fourier domain, it's like adding is multiplication in logspace - but in weird data string search space. The BWT.

  • horizion2025 3 days ago

    I also think when looking at how BWT works, it is initially surprising it is invertible even with such little information as the first character. It appears a bit like sorting the letters and that certainly would require a lot more information to back into place. It is a bit magic even when you know the inversion algorithm.

  • vintermann 3 days ago

    There is a fully bijective version of the BWT which doesn't require the index. It's not what BZip2 uses though.

    • Someone 3 days ago

      That method adds an extra character to your alphabet, and uses that as a marker in the transformed string to h effectively) store the index of the first character inside the transformed text.

      If, as is often the case, your alphabet contains exactly as many letters as a byte/word can store, that makes for awkward encoding. I also think that, typically, that method will require more memory.

      • vintermann 3 days ago

        No, I'm not talking about that. I'm talking about David Scott's bijective BWT variant. It sorts symbols by a slightly different type of context, and so can do away with a sentinel symbol or a stored index entirely. It was described in the wikipedia page last I checked.

        Scott was a bit of a weirdo, but his algorithm actually got recognition in academia, there have been many papers written about it. One character saved doesn't make a lot of difference for compression, but the transform being a permutation does give it some nice properties.

  • etrez 3 days ago

    Thx, corrected.