Holy _shit_ this paper, and the insight behind it....

Holy _shit_ this paper, and the insight behind it.

You know how every receiver is also a transmitter, _well_: every text predictor is also text compressor, and vice-versa.

You can outperform massive neural networks running millions of parameters, with a few lines of python and a novel application of _gzip_.

https://aclanthology.org/2023.findings-acl.426/

Like 13 Jul 2023 at 15:49 | Open on mastodon.social

57 comments

Darius Kazemi

@mhoye !!!!!

13 Jul 2023 at 15:52 | Open on friend.camp

mhoye

@darius Right!?!?

13 Jul 2023 at 16:31 | Open on mastodon.social

bmaxv

@mhoye I will have to look at this.

I always felt I was just a smidge not smart enough to really dive in and get machine learning, in a "let me implement this myself" way.

But I get nearest neighbors and I should be able to understand zipping.

13 Jul 2023 at 15:53 | Open on noc.social

Darius Kazemi

@mhoye cc @aparrish

13 Jul 2023 at 15:53 | Open on friend.camp

ranjit

@darius @aparrish i don't really understand the implications of this paper, but i'm hoping it means that gzip is sentient

13 Jul 2023 at 19:49 | Open on friend.camp

mhoye

If this is reliable, this is "take something that needed a datacenter last year and do it on a phone this year" material.

13 Jul 2023 at 15:54 | Open on mastodon.social

Greg Wilson

@mhoye every once in a while we find another one of these: https://en.wikipedia.org/wiki/Noether%27s_theorem

13 Jul 2023 at 15:56 | Open on mastodon.social

XaiaX

@mhoye I don’t know, seems like you would need the storage space to store all the references you’re comparing to, even if the computation is easy. Sounds like a time/space trade off.

13 Jul 2023 at 20:35 | Open on mastodon.social

Τοπάζ Αλαιν Φογτια Αννα Εμιλια

@XaiaX @mhoye but that's almost already the case, existing models are pretty large. I'd worry more about the computational complexity / representational examples for classification against (O(|to_classify| * |reference_dataset| * maxlen(to_classify {minkowski_plus} reference_dataset))

13 Jul 2023 at 21:30 | Open on chaos.social

mhoye

@fogti @XaiaX So, I don't actually think that acres of storage space is all that meaningful if you just want to focus good-enough utility. We all kind of know you're not getting more than varying heats of spicy autocomplete out of these tools, and the core insight of statistics as a field is that you can make a very accurate approximation of the state of a large dataset out of a small subset of that data. So, maybe a wikipedia dump and a gutenberg mirror is plenty?

13 Jul 2023 at 21:34 | Open on mastodon.social

Τοπάζ Αλαιν Φογτια Αννα Εμιλια

@mhoye @XaiaX > spicy autocomplete

huh, nah the topic here is text classification, which is similar to text prediction (and there is probably a way to use Huffman tables to produce suggestions, etc.), but not the same.

> So, maybe a wikipedia dump and a gutenberg mirror is plenty?

probably yes. (imo coolest would be producing a tool that could both classify and predict using the same infrastructure (light preprocessed large text dumps (<10GiB), massive improvement)

13 Jul 2023 at 21:38 | Open on chaos.social

mhoye

@fogti @XaiaX I'm a bit more interested in predictive text tools that give you stylistic nudges towards artists you admire, and finding a way to get artists paid for that. Smart autocomplete/smart fill tools that answer, "what might Degas have done right here?"

13 Jul 2023 at 21:53 | Open on mastodon.social

Τοπάζ Αλαιν Φογτια Αννα Εμιλια

@mhoye @XaiaX interesting thing is that it is probably *much* easier to simultaneously get the information for text completion *and* also what authors were involved in that match (instead of a large pool of authors just the relevant subset). [in the case of these compression-decompression + reference dataset models]

13 Jul 2023 at 21:55 | Open on chaos.social

Matt Stratford

@mhoye @fogti @XaiaX

Idle thought that LaTeX equations rendering properly would be a pretty good use case for a Mastodon client, given the overall geeky clientele on here! cc. @ivory

13 Jul 2023 at 23:04 | Open on mastodon.social

Dr. Robert M Flight

@mhoye Wait, what? So their claim is that anything that is decent at compression, should also be decent at prediction? So therefore we've missed the boat because all of the work on compression the past few decades means we have really good predictors?

Definitely going to be reading this.

13 Jul 2023 at 15:57 | Open on mastodon.social

Gabriele Svelto

@rmflight @mhoye symbol prediction is part of many of the best lossless compression algorithms, isn't it?

13 Jul 2023 at 16:03 | Open on fosstodon.org

mhoye

@gabrielesvelto @rmflight It is, and if this paper's results hold up it we're talking about how large scale deep-learning networks are fundamentally a technical dead end, and something that takes a datacenter to do with a DNN can be done better with a clever application of gzip on a phone.

13 Jul 2023 at 16:09 | Open on mastodon.social

mhoye

@gabrielesvelto @rmflight

Did I say phone?

I meant GameBoy.

13 Jul 2023 at 16:20 | Open on mastodon.social

Gabriele Svelto

@mhoye @rmflight "artificial intelligence" LMAO

13 Jul 2023 at 17:10 | Open on fosstodon.org

Nick Wood

@mhoye @gabrielesvelto @rmflight So what you’re saying is that I’m about to get a screaming deal on a graphics card?

13 Jul 2023 at 16:36 | Open on sfba.social

mhoye

@nickw @gabrielesvelto @rmflight My friend, you are about to get an _astounding_ deal on a graphics card.

13 Jul 2023 at 16:36 | Open on mastodon.social

Nick Wood

@mhoye @gabrielesvelto @rmflight Hold on. I’m headed out to Hayes Valley with a canvas bag and a roll of twenties.

13 Jul 2023 at 16:37 | Open on sfba.social

mhoye

@nickw @gabrielesvelto @rmflight

https://m.youtube.com/watch?v=ctKAwgxpASQ&t=53s

13 Jul 2023 at 16:41 | Open on mastodon.social

Nick Wood

@mhoye @gabrielesvelto @rmflight more like this https://youtu.be/x6Ul0thfc_Q

13 Jul 2023 at 16:43 | Open on sfba.social

Choong Ng

@mhoye @gabrielesvelto @rmflight On first look I think what this paper suggests is 1) for some classification tasks there's nicely simple approach that works well and 2) this is a promising path towards better feature engineering for language models that will in turn result in better accuracy vs cost.

13 Jul 2023 at 21:15 | Open on mstdn.social

Choong Ng

@mhoye @gabrielesvelto @rmflight If this works out well we'll see better + smaller models for all tasks (not just classification) that outperform both current DNNs and the NCD technique they use at moderate cost. There's precedent of this being a successful approach for example using frequency domain data for audio models instead of raw PCM. There's also precedent for finding ways DNNs waste a lot of capacity on effectively routing data around and restructuring to fix (ResNets for example).

13 Jul 2023 at 21:20 | Open on mstdn.social

Choong Ng

@mhoye @gabrielesvelto @rmflight Overall though in recent history data-based approaches have tended to win so I would expect the useful bits to get incorporated into DNNs rather than DNNs being obsoleted in almost any context. My favorite essay on that topic by Rich Sutton: http://incompleteideas.net/IncIdeas/BitterLesson.html

13 Jul 2023 at 21:22 | Open on mstdn.social

Paul M. Heider

@rmflight @mhoye Sounds about right. My old coworker Tom always joked that NLP was just a compression algorithm on the input text.

13 Jul 2023 at 16:18 | Open on fediscience.org

mhoye

@paulmheider @rmflight And Ted Chiang has referred to ML-generated text as "a jpeg of a language", yeah. But to see that come together in fifteen lines of python that out-do these massive, crazy expensive DNN models is bonkers, jaws on the floor material.

13 Jul 2023 at 16:33 | Open on mastodon.social

Severák

@mhoye what is it for? for detecting topic of the text without needing to costly train neural network?

13 Jul 2023 at 16:13 | Open on tiny.tilde.website

Daneel Adrian Cayce

@mhoye holy fuck this is potentially brilliant, and I am so excited to dive into reading this after work

13 Jul 2023 at 16:14 | Open on argon.city

mhoye

@sysop "Code is available at $URL" right there in the abstract! Holy shit!

13 Jul 2023 at 16:16 | Open on mastodon.social

@mhoye You had me at "novel application of gzip"

13 Jul 2023 at 16:54 | Open on mastodon.social

matzipan

@mhoye @jrconlin the abstract reads like a self aware joke. Please tell me it's a self-aware joke

13 Jul 2023 at 17:05 | Open on hachyderm.io

bob

@mhoye this is text classification not text prediction

13 Jul 2023 at 17:25 | Open on feed.hella.cheap

Emily S

@mhoye wait it works on a dataset in Pinyin?! Damn

And it's 14 lines of python

And it's not even looking at the contents of the compression, it's just using the byte lengths? Holy fucking shit that's smart as hell.

😍

13 Jul 2023 at 17:56 | Open on tech.lgbt

Tatjana Scheffler

@mhoye compression methods have been used in text classification for authorship analysis for quite a while.

13 Jul 2023 at 18:01 | Open on fediscience.org

XaiaX

@tschfflr @mhoye yes, I remember discussion of this at least a decade ago and probably longer.

Still interesting.

13 Jul 2023 at 20:14 | Open on mastodon.social

XaiaX

@tschfflr @mhoye after looking through it it seems like they acknowledge all that as the basis for this work.

13 Jul 2023 at 20:27 | Open on mastodon.social

Albert Cardona

@mhoye Compression is ... very interesting. Recalling here Schmidhuber's take on compression and its predictability being at the root of learning, beauty, novelty, interestingness, boredom ... https://arxiv.org/pdf/0812.4360

Previously: https://mathstodon.xyz/@albertcardona/110686536069075845

13 Jul 2023 at 18:35 | Open on mathstodon.xyz

allison

@mhoye reading this paper and cackling the whole time, it's just so clever

13 Jul 2023 at 18:51 | Open on friend.camp

Ben Zanin

@mhoye (this is not particularly novel, I remember reading a similar paper about using gzip to de-anonymize written works back around when I was learning about singular value decomposition to support vector space document searching from Maciej Ceglowski, I think back in the early 2000s. But it definitely is novel that this old technique is now outperforming current tech darlings.)

13 Jul 2023 at 19:20 | Open on mastodon.social

Ben Zanin

@mhoye that prediction ≈ compression is also the basis of the Hutter Prize

https://en.m.wikipedia.org/wiki/Hutter_Prize

13 Jul 2023 at 19:21 | Open on mastodon.social

Logan🏳️‍🌈

@mhoye Whoa.

13 Jul 2023 at 19:25 | Open on hachyderm.io

0x10f

@mhoye IIRC, a few years ago someone found a way to categorize music files (MIDI?) by testing which ones compress well together.

13 Jul 2023 at 19:41 | Open on tech.lgbt

Simon Cozens

@mhoye This is awesome, but I'm surprised it wasn't better known. I have vague memories of going to a talk by a researcher in Oxford about 25 years ago about using gzip compression for text analysis. His presentation explained about entropy and how compression is prediction, then looked at categorising text by gzipping it. Can't remember the name; some guy doing inference stuff in the psychology department. This is going to bug me now.

13 Jul 2023 at 20:21 | Open on typo.social

mhoye

@simoncozens From what I can tell the fact of it wasn't a big secret, but the idea that with apparently negligible effort you can outperform tools that are insanely expensive and wildly more complicated is the interesting part.

13 Jul 2023 at 20:23 | Open on mastodon.social

Simon Cozens

@mhoye The right algorithm in the right place beats an inscrutable pile of ReLUs every damned time.

13 Jul 2023 at 20:29 | Open on typo.social

mhoye

@simoncozens I read this in the G-Man's voice from Half Life.

13 Jul 2023 at 20:31 | Open on mastodon.social

Simon Cozens

@mhoye Found it. Patrick Juola was the guy. 1998, categorizing the complexity of languages by gzipping them. https://scholar.google.co.uk/scholar?q=patrick+juola+complexity+compression&hl=en&as_sdt=0&as_vis=1&oi=scholart#d=gs_qabs&t=1689286759338&u=%23p%3DbcHXg7kuXAIJ

13 Jul 2023 at 22:21 | Open on typo.social

Angle

@mhoye ...That's mildly terrifying if it works out, actually. A major jump in AI capabilities. Eep. :/

13 Jul 2023 at 20:26 | Open on anticapitalist.party

mhoye

@Angle I don't think it's a jump in capabilities, just a massive decrease in cost and complexity, which is different (and democratizing!)

13 Jul 2023 at 20:28 | Open on mastodon.social

Adam Nelson

@mhoye So this works for classification, but can it be adapted to work for transformers with attention? I still can't fully get my head around how transformers work, so I don't know if this technique translates, but if it does... 🤯

13 Jul 2023 at 20:42 | Open on fosstodon.org

tinyrabbit

@mhoye you can outperform a non-pre-trained deep neural network. I don’t understand text classification enough to know what the difference is there. ”Non-pre-trained” sounds like a key factor in that comparison though.

13 Jul 2023 at 20:46 | Open on floss.social

DELETED

@mhoye This is way beyond my understanding of text classification, but I can vaguely see what they are saying. How did nobody think of that before?

13 Jul 2023 at 20:54 | Open on toot.site

Erik Ableson

@mhoye interesting tidbit in the paper.

13 Jul 2023 at 21:01 | Open on mastodon.infrageeks.social

mhoye

@erik That part was fascinating, wasn't it? And, like, a mic drop in the middle of It's Raining Microphones. "Yeah, we tested it on Hanyu Pinyin, works great", and the whole AI community is sitting there saying "you what".

13 Jul 2023 at 21:09 | Open on mastodon.social

Go Up