So, for a little project I needed to compress a bit... | Devine Lu Linvega

Devine Lu Linvega's posts Post Back to profile

Devine Lu Linvega

So, for a little project I needed to compress a bit of data, and @cancel and I made up a spec that should be possible to implemented on small systems without too much headaches.

I wrote a bit of documentation for it, and I was wondering if anyone wanted to put the docs to the test and see if they could write a toy implementation in a language of their choice. It'd help me see if there's things missing in the docs.

http://wiki.xxiivv.com/site/ulz_format

So, think of it as a little puzzle.

Like 16 Nov 2023 at 0:46 | Open on merveilles.town

42 comments

@neauoire @cancel trying my hand at a C# version of this!

16 Nov 2023 at 1:16 | Open on merveilles.town

Devine Lu Linvega

@renaudbedard @cancel Keep me posted :>

16 Nov 2023 at 2:12 | Open on merveilles.town

@neauoire @cancel A couple of things that are tripping me :
- CPY uses a "negative offset plus 1", but it's not clear if it's "-offset + 1" or "-(offset + 1)"
- It's implied that the dictionary pointer is always pointing to the end of the dictionary buffer after the last copy/append operation, right?

16 Nov 2023 at 2:19 | Open on merveilles.town

Devine Lu Linvega

@renaudbedard @cancel

Example: an offset of 0 means go back by 1 bytes into the history.

And the pointer is the at the end of the dictionary buffer yep.

I'll add a bit about the negative increment to the offset to the docs! thanks for pointing it out

16 Nov 2023 at 2:21 | Open on merveilles.town

@neauoire @cancel also I know HTML tables are a nightmare, but if you could make the LIT length box appear to take all 7 bits instead of 6 like CPY1... :cooldog:

16 Nov 2023 at 2:29 | Open on merveilles.town

Devine Lu Linvega

@renaudbedard @cancel mhm I thought a fixed that, could you reload?

16 Nov 2023 at 3:10 | Open on merveilles.town

@neauoire @cancel yep it’s good now!

16 Nov 2023 at 12:32 | Open on merveilles.town

DELETED

@neauoire @cancel one potential issue I see which may not be an issue is syncronisation. if my “read” alignment is 1 byte, and for whatever reason the read pointer ends up on the second byte of a cpy2 instruction, (perhaps due to input file corruption) it won’t necessarily be obvious that byte is not a lit, cpy1 or the first byte of a cpy2 instruction. it is my understanding that dysyncing is a potential risk of variable byte encodings

16 Nov 2023 at 4:18 | Open on merveilles.town

Devine Lu Linvega

@zens @cancel is your issue that it's not resilient to data corruption? In the wider system implementation we have a checksum. Do you think would be enough?
https://git.sr.ht/~rabbits/uxn-utils/tree/main/item/cli/checksum/checksum.tal

16 Nov 2023 at 4:21 | Open on merveilles.town

DELETED

@neauoire @cancel it’s more that those leading bits remind me of the leading bits of utf-8, why utf-8 was designed that way- and that if this were following the utf-8 pattern, cpy2 would have 110 as the leading bits for both of its bytes.

I am not saying you should do that, but have a look at why utf-8 did it that way. (from memory, line transmission corruption can either take out a few characters as long as sync is maintained… or the entire rest of your file if it is not)

16 Nov 2023 at 4:25 | Open on merveilles.town

@neauoire @cancel here is a naive Rust implementation:

https://play.rust-lang.org/?version=stable&mode=release&edition=2021&code=fn+decode%28data%3A+%26%5Bu8%5D%29+-%3E+String+%7B%0A++++let+mut+output+%3D+vec%21%5B%5D%3B%0A++++let+mut+iter+%3D+data.iter%28%29%3B%0A++++while+let+Some%28byte%29+%3D+iter.next%28%29+%7B%0A++++++++match+byte+%26+0x80+%7B%0A++++++++++++%2F%2F+LIT%0A++++++++++++0x00+%3D%3E+%280..%28byte+%2B+1%29%29.for_each%28%7C_%7C+output.push%28*iter.next%28%29.unwrap%28%29%29%29%2C%0A++++++++++++%2F%2F+CPY%0A++++++++++++_+%3D%3E+%7B%0A++++++++++++++++let+%28length%2C+offset%29+%3D+match+byte+%26+0x40+%7B%0A++++++++++++++++++++0x00+%3D%3E+%28%0A++++++++++++++++++++++++u16%3A%3Afrom%28byte+%26+0x3f%29+%2B+4%2C%0A++++++++++++++++++++++++usize%3A%3Afrom%28*iter.next%28%29.unwrap%28%29%29+%2B+1%2C%0A++++++++++++++++++++%29%2C%0A++++++++++++++++++++_+%3D%3E+%7B%0A++++++++++++++++++++++++let+length%3A+u16+%3D%0A++++++++++++++++++++++++++++u16%3A%3Afrom%28byte+%26+0x3f%29+%3C%3C+8+%7C+u16%3A%3Afrom%28*iter.next%28%29.unwrap%28%29%29%3B%0A++++++++++++++++++++++++%28length+%2B+4%2C+usize%3A%3Afrom%28*iter.next%28%29.unwrap%28%29%29+%2B+1%29%0A++++++++++++++++++++%7D%0A++++++++++++++++%7D%3B%0A++++++++++++++++%280..length%29.for_each%28%7C_%7C+output.push%28output%5Boutput.len%28%29+-+offset%5D%29%29%0A++++++++++++%7D%0A++++++++%7D%0A++++%7D%0A++++std%3A%3Astr%3A%3Afrom_utf8%28%26output%29.unwrap%28%29.to_string%28%29%0A%7D%0A%0Aconst+ENCODED_DATA%3A+%26%5Bu8%5D+%3D+%26%5B%0A++++40%2C+66%2C+108%2C+117%2C+101%2C+32%2C+108%2C+105%2C+107%2C+101%2C+32%2C+109%2C+121%2C+32%2C+99%2C+111%2C+114%2C+118%2C+101%2C+116%2C%0A++++116%2C+101%2C+32%2C+105%2C+116%2C+115%2C+32%2C+105%2C+110%2C+32%2C+97%2C+110%2C+100%2C+32%2C+111%2C+117%2C+116%2C+115%2C+105%2C+100%2C%0A++++101%2C+10%2C+129%2C+40%2C+35%2C+97%2C+114%2C+101%2C+32%2C+116%2C+104%2C+101%2C+32%2C+119%2C+111%2C+114%2C+100%2C+115%2C+32%2C+73%2C+32%2C%0A++++115%2C+97%2C+121%2C+10%2C+65%2C+110%2C+100%2C+32%2C+119%2C+104%2C+97%2C+116%2C+32%2C+73%2C+32%2C+116%2C+104%2C+105%2C+110%2C+107%2C%0A++++138%2C+41%2C+9%2C+102%2C+101%2C+101%2C+108%2C+105%2C+110%2C+103%2C+115%2C+10%2C+84%2C+128%2C+34%2C+6%2C+108%2C+105%2C+118%2C+101%2C+32%2C%0A++++105%2C+110%2C+128%2C+80%2C+23%2C+32%2C+109%2C+101%2C+10%2C+73%2C+39%2C+109%2C+32%2C+98%2C+108%2C+117%2C+101%2C+10%2C+68%2C+97%2C+32%2C%0A++++98%2C+97%2C+32%2C+100%2C+101%2C+101%2C+32%2C+100%2C+130%2C+9%2C+0%2C+105%2C+181%2C+18%2C%0A%5D%3B%0A%0Aconst+DECODED_DATA%3A+%26str+%3D+%22Blue+like+my+corvette+its+in+and+outside%0ABlue+are+the+words+I+say%0AAnd+what+I+think%0ABlue+are+the+feelings%0AThat+live+inside+me%0AI%27m+blue%0ADa+ba+dee+da+ba+di%0ADa+ba+dee+da+ba+di%0ADa+ba+dee+da+ba+di%0ADa+ba+dee+da+ba+di%22%3B%0A%0Afn+main%28%29+%7B%0A++++assert_eq%21%28decode%28ENCODED_DATA%29%2C+DECODED_DATA%29%3B%0A%7D%0A

@neauoire @cancel here is a naive Rust implementation:

https://play.rust-lang.org/?version=st...

Expand text...

16 Nov 2023 at 7:37 | Open on infosec.exchange

[DATA EXPUNGED]

WimⓂ️

@neauoire This is my interpretation, is it correct?

0 LIT:7 <up to 2^7-1 bytes which are not commands>
10 CPY1:6 < copy up to 2^6-1 bytes from offset; offset is a byte >
11 CPY2:14 < copy up to 2^14-1 bytes from offset; offset is a byte >

16 Nov 2023 at 10:01 | Open on merveilles.town

Devine Lu Linvega

@wim_v12e @cancel yup that looks right

16 Nov 2023 at 15:57 | Open on merveilles.town

max22-

@neauoire @cancel I've made a little implementation in Go : https://github.com/max22-/ulz-go
it was a little bit difficult to understand that you can copy data even if the length goes past the end of the output buffer (i did an equivalent of memcpy, and it didn't work ^^)

16 Nov 2023 at 20:48 | Open on mastodon.xyz

Devine Lu Linvega

@maxime_andre @cancel ah! that trips many people, how would you explain this so others aren't tripped by it?

17 Nov 2023 at 1:25 | Open on merveilles.town

max22-

@neauoire @cancel 🤔 maybe a little drawing ?

17 Nov 2023 at 8:24 | Open on mastodon.xyz

Devine Lu Linvega

Thanks to everyone who answered my puzzle and wrote experimental implementations of our little LZ scheme! You've helped us improved the documentation and see how portable of an algorithm it is across multiple languages!

17 Nov 2023 at 3:24 | Open on merveilles.town

Nico

@neauoire @cancel Nice! But the encoded data doesn't use CPY2 does it? I expected the example to guide me towards a full implementation, so I was surprised to see the full text after only LIT and CPY1! :)

17 Nov 2023 at 18:53 | Open on social.sdf.org

Devine Lu Linvega

@nicolagi @cancel ah you're right, I chose a segment that's too small. I'll fix the example :) thanks for pointing that out.

17 Nov 2023 at 19:12 | Open on merveilles.town

Nico

@neauoire No need to update the example! My message was a way of sharing my enthusiasm for working through your puzzle. Just showing appreciation and checking I didn't miss something; not intending to criticize or create work... sorry if it came across differently!

19 Nov 2023 at 9:59 | Open on social.sdf.org

Devine Lu Linvega

@nicolagi nono it's all good :) I've since added a longer example to the repo(as to not clog the documentatio) It's good, I wanted to have a way to benchmark CPY2 as well!

19 Nov 2023 at 15:44 | Open on merveilles.town

Verwechslungsgefährte 🍿

@neauoire @cancel Is there 2 lines of "Da ba dee da ba di" too many in your deflated example?

19 Nov 2023 at 12:01 | Open on dresden.network

Devine Lu Linvega

@dichotomiker @cancel no, there's supposed to be 4, are you only getting two?

19 Nov 2023 at 17:44 | Open on merveilles.town

Verwechslungsgefährte 🍿

@neauoire @cancel Ok, then I'm in painful trouble with my indices here.

19 Nov 2023 at 17:53 | Open on dresden.network

Devine Lu Linvega

@dichotomiker @cancel Here's the encoder and decoder so you can try with an example of your own to see where it fails.
https://git.sr.ht/~rabbits/uxn-utils/tree/main/item/cli/lz/ulzenc.c
https://git.sr.ht/~rabbits/uxn-utils/tree/main/item/cli/lz/ulzdec.c

19 Nov 2023 at 17:54 | Open on merveilles.town

cancel

@neauoire @dichotomiker you should put a note that the encoder in uxn-utils is unsafe and the one from Uxn32 should be used for real software

19 Nov 2023 at 18:04 | Open on merveilles.town

Devine Lu Linvega

@cancel @dichotomiker sure, would you like to make a standalone file in the repo that I can link people to?

19 Nov 2023 at 18:05 | Open on merveilles.town

cancel

@neauoire @dichotomiker uxn_lz.c and uxn_lz.h have no dependencies

19 Nov 2023 at 18:05 | Open on merveilles.town

Devine Lu Linvega

@cancel I mean with build instructions and main(), so people can build it without too much fussing with it as a library.

19 Nov 2023 at 18:06 | Open on merveilles.town

cancel

@neauoire hmm. why not just copy uxn_lz.c and uxn_lz.h to your uxn-utils and use them?

19 Nov 2023 at 18:07 | Open on merveilles.town

Devine Lu Linvega

@cancel sure I can do that, I'll set it up today

19 Nov 2023 at 18:07 | Open on merveilles.town

cancel replied to Devine Lu Linvega

@neauoire nice :)

The reason I haven't made an example standalone program in C for it in the Uxn32 repo is that none of the Uxn32 code currently uses the C standard library, and doing a cross-platform cmd line program that does file operations would require to suddenly switch to using it. Not the end of the world but… well maybe I'll do it eventually.

19 Nov 2023 at 18:08 | Open on merveilles.town

Devine Lu Linvega replied to cancel

@cancel ah! I didn't know that. Good to know, yeah I'll bring the ref implementation and make a README for it!

19 Nov 2023 at 18:09 | Open on merveilles.town

cancel replied to Devine Lu Linvega

@neauoire sorry, I didn't document how to use the streaming decompressor yet. (Having to swap hard drives to work on Uxn32 is turning out to be pretty inconvenient...)

19 Nov 2023 at 18:10 | Open on merveilles.town

Verwechslungsgefährte 🍿

@neauoire @cancel Thanks. I got it working now.
https://codeberg.org/spazzpp2/julz/src/branch/main
(Never do trial and error on index offsets in the middle of the night. Also, know your libraries).

I mostly used the table and the C implementation from http://wiki.xxiivv.com/site/ulz_format for reference.

I was a little disappointed when I learned that CPY doesn't copy from the input but from the output array. So, no self-modification. Good for debugging, though.

Would you get more out of it when you assume, the compression always starts with LIT and then simply alternates between CPY and LIT? Their first bit would then be obsolete (twice as much length). You probably need a zero-length LIT as well.

CPY with length > offset repeating over and over again is a really nice idea, especially for ordered dither images!

@neauoire @cancel Thanks. I got it working now.
https://codeberg.org/spazzpp2/julz/src/branch/main
(Never do trial and error on index offsets in the middle of the night. Also, know your libraries).

I mostly used the table and the C implementation from http://wiki.xxiivv.com/site/ulz_format for reference.

Expand text...

20 Nov 2023 at 4:36 | Open on dresden.network

@reiver ⊼ (Charles) :batman:

@neauoire @cancel

Have you considered including 'magic-bytes' with your ULZ file format?

...

The usage of it would be —

If someone doesn't know what the file-name is, they could still determine the type of the file.

...

It could be something as simple as the first by 7 bytes of the file format being:

55 4C 5A 2F 31 0D 0A

Which, if interpreted as ASCII or Unicode UTF-8 would be:

"ULZ/1\r\n"

...

Here are magic bytes for other file formats:

https://en.wikipedia.org/wiki/List_of_file_signatures

.

19 Nov 2023 at 13:10 | Open on mastodon.social

Devine Lu Linvega

@reiver @cancel it will be used as parts of other formats, which might have identifiers yes :)

19 Nov 2023 at 15:46 | Open on merveilles.town

@neauoire @cancel really enjoyed this! when a video of it popped up on YouTube I read the wiki

24 Nov 2023 at 2:20 | Open on mastodon.gamedev.place