Author Topic: Slow compression issue (and a crash)  (Read 6440 times)

saga

  • Posts: 2763
Slow compression issue (and a crash)
« on: 14 Sep '20 - 18:57 »
Since the MO3 format's LZ compression theoretically allows the window to grow infinitely large, I was wondering what the largest window size the official MO3 encoder actually uses. So I created an IT file with a very long and repetitive song message (~22K space characters), and it seems like that upset the encoder. :) I have attached two files, crash.it causes a crash apparently because there are no samples and the pattern is completely empty. The second file, hang.it, doesn't crash but probably hits a very unoptimized code path in the encoder: It took about 15 to 20 minutes to process the file.

Anyway, on the original topic I was researching: For the sample compression algorithm in MO3, a clear upper decompression size can be established because every sampling point is represented by at least two bits, so the decompressed data can at most be 4 times larger than the uncompressed data. For the music data it's more complicated and I think the decompressed size is essentially unbounded.
Given the behaviour I observed with hang.it, I suppose the official MO3 encoder has no upper bound on string lengths and just keeps growing its window if the input data allows for it? The reason I am wondering is to reduce the chance of denial-of-service attacks when loading MO3 files - right now for example the file header could claim that the decompressed size of the music blob is a gigabyte while the file itself is only a kilobyte in size (and as hang.it demonstrates, this could even be a syntactically valid MO3 file, it doesn't have to be malformed garbage). I have eliminated that risk with sample data as described above, but for the music data I think this isn't really possible by just looking at the algorithm. I could of course assume that the variable that receives the LZ string length and offset is a 32-bit integer but that doesn't really establish a helpful upper bound.
« Last Edit: 14 Sep '20 - 21:49 by saga »

Ian @ un4seen

  • Administrator
  • Posts: 26137
Re: Slow compression issue (and a crash)
« Reply #1 on: 17 Sep '20 - 17:13 »
Since the MO3 format's LZ compression theoretically allows the window to grow infinitely large, I was wondering what the largest window size the official MO3 encoder actually uses. So I created an IT file with a very long and repetitive song message (~22K space characters), and it seems like that upset the encoder. :) I have attached two files, crash.it causes a crash apparently because there are no samples and the pattern is completely empty. The second file, hang.it, doesn't crash but probably hits a very unoptimized code path in the encoder: It took about 15 to 20 minutes to process the file.

Oops! The MO3 2.4.2 encoder is indeed crashing when there's no sample data. Here's an update that should fix that, and the hanging problem too:

   www.un4seen.com/stuff/mo3.exe

Let me know if you still see any problems with it.

Given the behaviour I observed with hang.it, I suppose the official MO3 encoder has no upper bound on string lengths and just keeps growing its window if the input data allows for it? The reason I am wondering is to reduce the chance of denial-of-service attacks when loading MO3 files - right now for example the file header could claim that the decompressed size of the music blob is a gigabyte while the file itself is only a kilobyte in size (and as hang.it demonstrates, this could even be a syntactically valid MO3 file, it doesn't have to be malformed garbage).

Yes, matches could in theory be any length. It's probably safe to assume that the data will never legitimately be anywhere near 1GB though :) ... It might be possible to calculate a ballpark limit from the size of the MO3 structures and the maximum number of them in a file.

By "denial-of-service attacks", I guess you mean using up memory? The memory should only be in use briefly anyway, as an app would usually release it straight after parsing the decoded data (it isn't needed any more after that).

saga

  • Posts: 2763
Re: Slow compression issue (and a crash)
« Reply #2 on: 17 Sep '20 - 17:48 »
I can confirm that the new version no longer crashes / hangs. :)

By "denial-of-service attacks", I guess you mean using up memory? The memory should only be in use briefly anyway, as an app would usually release it straight after parsing the decoded data (it isn't needed any more after that).
It's true that it's only used for a short while, and in a 32-bit process it might even fail to be allocated. Depending on the scenario the code is used in, even a brief moment of allocating a lot of memory can still be troublesome, though (think of a server scanning module files uploaded by users). I just want to limit that risk as much as possible with libopenmpt - for most formats, only slightly more memory is required to decode the file than the actual file size, but compressed formats are of course more dangerous in that regard.
For MO3 I only see two possible (not-really-)solutions: I suppose for most real-world files one could assume that the resulting file will not be, say, 1000 times bigger than the compressed file. So a 1KB file would not cause more than 1MB of memory to be allocated. The other alternative would only apply to reject corrupted files: Try to decode the music data without actually decompressing it, and if the resulting size is not what the header claimed, reject the file. The disadvantage here is of course that decoding takes longer for all "real" MO3 files, so I don't like that approach a lot. I guess this is a dead-end for now. :)

saga

  • Posts: 2763
Re: Slow compression issue (and a crash)
« Reply #3 on: 17 Apr '24 - 19:14 »
Picking up this topic again... recently I have changed my MO3 decoder to decompress data on-the-fly. That way, even if the music data is claimed to be very large, only as much data as is required to read the MO3 structures will actually be decompressed. That makes it a bit harder to force the application to waste memory, but there are still a few way to do that, and I think they could be mitigated by defining a few limits for certain problematic structures:

- Track data length: This one is easy. It's encoded as a uint32 value, but we can actually establish an upper limit of about 2MB per track (65535 rows containing at most 15 events, that's 2,031,585 bytes). Hence, this would only need to be verified in the decoder, the encoder cannot produce a track that is too big.
- String lengths: This one is a bit more tricky. A one gigabyte long string containing just "AAAAAAAAAAAAAAAAAAAAAAAAAAA...." with no null byte anywhere can be encoded in just a few bytes. Since the song name is the first field in the music data, you cannot even verify any of the other header structures if they make sense without decompressing that string. Maybe we can say that any song, instrument or sample name longer than 1024 characters should be considered invalid (that's probably long enough to cover any sensible format extensions), and the song message should not be longer than 1024 * 1024 characters. Could also increase it to 8KB and 8MB respectively to be really future-proof. ;)
- Plugin chunks: This one is tricky because there are some plugins out there which save their plugin chunks in a really wasteful way. Chunk sizes beyond 1MB are not unheard of. But I would guess that as with instrument names, a limit like 64MB would be well beyond any in-the-wild chunk sizes, and would still prevent some wasteful memory allocation to happen (but probably not enough, as 250 plugins with 64 MB chunks each would still be bigger than the physical maximum size of the uncompressed music chunk anyway).
- The new chunk extensions at the end of the file: As there can be an arbitrary amount of them, this one is again very tricky. Apart from establishing a similar limit as for plugin chunks, I'm not sure what to do here.

What do you think? Would adding any of those limitations to the MO3 encoder make sense? Obviously some of those limits are not really applicable to the MO3 encoder right now (there's no file formats it supports that have song titles that long, for instance) - but I think it would be good to just have a common definition of what is considered to be a valid MO3 file and what isn't.

In general limiting the size of the LZ window is probably also not going to work when existing real-world modules have to be accounted for - even if the back window was limited to 16K, a file smaller than 1MB could still produce a 2GB output, I think. So disallowing some of those music data structures to even grow that big might be the better approach.
« Last Edit: 18 Apr '24 - 09:04 by saga »

Ian @ un4seen

  • Administrator
  • Posts: 26137
Re: Slow compression issue (and a crash)
« Reply #4 on: 18 Apr '24 - 16:22 »
I'm not sure I like the idea of imposing limits unnecessarily, in case they need to be exceeded in future. But when you know that certain things won't exceed a certain size in the file's original format then it should be fine for you to assume the same in the MO3 file. For example, the extra chunks you mention are only in OpenMPT-created files, so if OpenMPT imposes limits on them then you could assume the same in MO3 files. Similarly for strings, you could assume that they won't exceed the limits that the original format has. For the song name and message strings (which come before the original format info), you could use the largest amongst all currently supported formats, which I think is 28 and 65535 respectively (unless OpenMPT allows longer in "text" chunks?).

Do you have any particular dodgy MO3 files causing concern? It'd be interesting to see them.

saga

  • Posts: 2763
Re: Slow compression issue (and a crash)
« Reply #5 on: 28 Apr '24 - 17:32 »
Similarly for strings, you could assume that they won't exceed the limits that the original format has. For the song name and message strings (which come before the original format info), you could use the largest amongst all currently supported formats, which I think is 28 and 65535 respectively (unless OpenMPT allows longer in "text" chunks?).
OpenMPT pretty much allows for the song message to be any size. With the OpenMPT extensions it's similar, especially plugin chunks could be any size.
Establishing limits for specific file types is unfortunately going to address the problem at its heart - what I want to avoid is all possible worst-case scenarios where a disproportional amount of memory is consumed for syntactically valid but still nonsensical files. Unfortunately I think the song message and OpenMPT extensions in particular won't allow for establishing any practical safe limits - hence I decided that for now, I will not accept any MO3 files with more than 512MB of uncompressed music data (i.e. ignoring all sample data). Practically speaking, such a big music data chunk could only be achieved with plugins embedding huge data chunks in the file, but I don't think it's very realistic (I tried this with a plugin whose data chunk is a bit less than 1MB in size and I couldn't load more than 80 plugin instances before things started falling apart... :) .

Do you have any particular dodgy MO3 files causing concern? It'd be interesting to see them.
I don't have any specific examples of dodgy files that generate a "valid" MO3 music data structure, but when fuzzing OpenMPT's file loaders I do regularly run into generated MO3 files that would cause hundreds of MB of data to be decompressed. For example a common case is the repetation of a single character for a billion times or so, causing several GB of memory to be allocated (if not restricted like in my updated code) from just a few bytes of input. It's not a particularly "exciting" example but I attached an MO3 file generated through fuzzing which causes XMPlay to temporarily allocate more than 800MB of memory, for example.

Ian @ un4seen

  • Administrator
  • Posts: 26137
Re: Slow compression issue (and a crash)
« Reply #6 on: 1 May '24 - 14:17 »
I decided that for now, I will not accept any MO3 files with more than 512MB of uncompressed music data (i.e. ignoring all sample data). Practically speaking, such a big music data chunk could only be achieved with plugins embedding huge data chunks in the file, but I don't think it's very realistic (I tried this with a plugin whose data chunk is a bit less than 1MB in size and I couldn't load more than 80 plugin instances before things started falling apart... :) .

Seems a safe enough assumption :)

I don't have any specific examples of dodgy files that generate a "valid" MO3 music data structure, but when fuzzing OpenMPT's file loaders I do regularly run into generated MO3 files that would cause hundreds of MB of data to be decompressed. For example a common case is the repetation of a single character for a billion times or so, causing several GB of memory to be allocated (if not restricted like in my updated code) from just a few bytes of input. It's not a particularly "exciting" example but I attached an MO3 file generated through fuzzing which causes XMPlay to temporarily allocate more than 800MB of memory, for example.

The MO3 header's decompressed length value determines how much memory is allocated while loading, and that's invalid in this file, resulting in decompression failing (the decompressed output should match the header value). So a way to avoid the dodgy large memory allocation would be to first try decoding the data without writing it to memory, and then only decode for real if that's successful.

saga

  • Posts: 2763
Re: Slow compression issue (and a crash)
« Reply #7 on: 3 May '24 - 19:00 »
The MO3 header's decompressed length value determines how much memory is allocated while loading, and that's invalid in this file, resulting in decompression failing (the decompressed output should match the header value). So a way to avoid the dodgy large memory allocation would be to first try decoding the data without writing it to memory, and then only decode for real if that's successful.
I decided against doing so since it pessimizes decompression times for valid modules, and it doesn't really fix the problem completely anyway - once can still create a valid compression stream with the same properties (e.g. by claiming plugin chanks are really big, as mentioned before). So I think just limiting the upper limit for the decompressed size is the only sensible thing any decoder can do for now.