I'd like to highlight the Introduction of nmcp.pdf which explains in a very compressed form how it works which encompass all the magic.
"The lossless data compressor employs the traditional predictive approach: at each time t, the encoder uses the neural network model to compute the probability vector p of the next symbol values t knowing all the preceding symbols s0 up to st−1. The actual symbol value st is encoded using an arithmetic encoder with approximately −log2 ( pst ) bits. Then the model is updated knowing the symbol st. The decoder works symmetrically so there is no need to transmit the model parameters. It implies both encoder and decoder update their model identically.
When no preprocessing is done, st represents the byte at position t. Hence there are Ns= 256 different symbol values from 0 to Ns−1."
For those usually familiar with Huffman coding this is using an Arithmetic coding https://en.wikipedia.org/wiki/Arithmetic_coding
It allows adaptive coding (i.e. changing the probabilities of symbols dynamically). Here these probabilities are modelled dynamically using a neural network.
The magic is that the neural network parameters are defined implicitly : You don't need to transmit the neural network parameters, therefore you can use as big as you want neural network.
During the decoding the from scratch network is continuously trained using the freshly decoded data, in the same way that it was during the encoding. The more data you compress, the more you train the internal neural network parameters, and the better the prediction for next character gets.
You decode 20 bytes, you update the model with this new data, you decode 20 new bytes with the new model, you update the model,... , every 1000000 bytes you can even update your model multiple times using all currently available data to make the network converge faster.
Of course this only works if everything is exactly deterministic, which Fabrice Bellard took great effort in guaranteeing, which is no small engineering feat.
I was just working on a similar project and didn't think you could do decode without sending weights, amazing!
The concept is rather intriguing in terms of having a compression algorithm that evolves towards an ideal optimization asymptote. I, for one, would be annoyed at the thought that the compression of an identical artifact might result in a different compression size output as it would seed doubt as to whether they actually had the same input.
As long as the initial state of the network is fixed, the compressed output will be identical every time.
What is the probability that there are no collisions by the encoder?
100%.
What do you mean by collisions?
As described, arithmetic coding does a 1:1 mapping of input-output, while NN deterministically generates probability distribution for the arithmetic coding. Conceptually, it would work with any deterministic / reproducible function, even a PRF or a running hash, though the point of NN is to give good distribution estimates for interesting inputs, thus short output from the arithmetic encoder.
Presumably, there are some limited choices during coding, which then need to be encoded as well.
Is the model actually trained (=updating the weights instead of just the state) during en/decoding?
Yes that's the trick, the model is trained both in the encoding and the decoding from the data that is encoded/decoded, continuously as data flows, from the same scratch in both cases and then trained with exactly the same data (the raw data).
Because of training is deterministic and the same data is presented during encoding and decoding, the weights of encoder and decoder are always kept in sync, such that the network used when encoding the nth byte is exactly the same as the network used when decoding the nth byte.
The weights of the network are the internal state.