Email or username:

Password:

Forgot your password?
mhoye

Holy _shit_ this paper, and the insight behind it.

You know how every receiver is also a transmitter, _well_: every text predictor is also text compressor, and vice-versa.

You can outperform massive neural networks running millions of parameters, with a few lines of python and a novel application of _gzip_.

aclanthology.org/2023.findings

57 comments
bmaxv

@mhoye I will have to look at this.

I always felt I was just a smidge not smart enough to really dive in and get machine learning, in a "let me implement this myself" way.

But I get nearest neighbors and I should be able to understand zipping.

ranjit

@darius @aparrish i don't really understand the implications of this paper, but i'm hoping it means that gzip is sentient

mhoye

If this is reliable, this is "take something that needed a datacenter last year and do it on a phone this year" material.

XaiaX

@mhoye I don’t know, seems like you would need the storage space to store all the references you’re comparing to, even if the computation is easy. Sounds like a time/space trade off.

Τοπάζ Αλαιν Φογτια Αννα Εμιλια

@XaiaX @mhoye but that's almost already the case, existing models are pretty large. I'd worry more about the computational complexity / representational examples for classification against (O(|to_classify| * |reference_dataset| * maxlen(to_classify {minkowski_plus} reference_dataset))

mhoye

@fogti @XaiaX So, I don't actually think that acres of storage space is all that meaningful if you just want to focus good-enough utility. We all kind of know you're not getting more than varying heats of spicy autocomplete out of these tools, and the core insight of statistics as a field is that you can make a very accurate approximation of the state of a large dataset out of a small subset of that data. So, maybe a wikipedia dump and a gutenberg mirror is plenty?

Τοπάζ Αλαιν Φογτια Αννα Εμιλια

@mhoye @XaiaX > spicy autocomplete

huh, nah the topic here is text classification, which is similar to text prediction (and there is probably a way to use Huffman tables to produce suggestions, etc.), but not the same.

> So, maybe a wikipedia dump and a gutenberg mirror is plenty?

probably yes. (imo coolest would be producing a tool that could both classify and predict using the same infrastructure (light preprocessed large text dumps (<10GiB), massive improvement)

mhoye

@fogti @XaiaX I'm a bit more interested in predictive text tools that give you stylistic nudges towards artists you admire, and finding a way to get artists paid for that. Smart autocomplete/smart fill tools that answer, "what might Degas have done right here?"

Τοπάζ Αλαιν Φογτια Αννα Εμιλια

@mhoye @XaiaX interesting thing is that it is probably *much* easier to simultaneously get the information for text completion *and* also what authors were involved in that match (instead of a large pool of authors just the relevant subset). [in the case of these compression-decompression + reference dataset models]

Matt Stratford

@mhoye @fogti @XaiaX

Idle thought that LaTeX equations rendering properly would be a pretty good use case for a Mastodon client, given the overall geeky clientele on here! cc. @ivory

Dr. Robert M Flight

@mhoye Wait, what? So their claim is that anything that is decent at compression, should also be decent at prediction? So therefore we've missed the boat because all of the work on compression the past few decades means we have really good predictors?

Definitely going to be reading this.

Gabriele Svelto

@rmflight @mhoye symbol prediction is part of many of the best lossless compression algorithms, isn't it?

mhoye

@gabrielesvelto @rmflight It is, and if this paper's results hold up it we're talking about how large scale deep-learning networks are fundamentally a technical dead end, and something that takes a datacenter to do with a DNN can be done better with a clever application of gzip on a phone.

Nick Wood

@mhoye @gabrielesvelto @rmflight So what you’re saying is that I’m about to get a screaming deal on a graphics card?

Choong Ng

@mhoye @gabrielesvelto @rmflight On first look I think what this paper suggests is 1) for some classification tasks there's nicely simple approach that works well and 2) this is a promising path towards better feature engineering for language models that will in turn result in better accuracy vs cost.

Choong Ng

@mhoye @gabrielesvelto @rmflight If this works out well we'll see better + smaller models for all tasks (not just classification) that outperform both current DNNs and the NCD technique they use at moderate cost. There's precedent of this being a successful approach for example using frequency domain data for audio models instead of raw PCM. There's also precedent for finding ways DNNs waste a lot of capacity on effectively routing data around and restructuring to fix (ResNets for example).

Choong Ng

@mhoye @gabrielesvelto @rmflight Overall though in recent history data-based approaches have tended to win so I would expect the useful bits to get incorporated into DNNs rather than DNNs being obsoleted in almost any context. My favorite essay on that topic by Rich Sutton: incompleteideas.net/IncIdeas/B

Paul M. Heider

@rmflight @mhoye Sounds about right. My old coworker Tom always joked that NLP was just a compression algorithm on the input text.

mhoye

@paulmheider @rmflight And Ted Chiang has referred to ML-generated text as "a jpeg of a language", yeah. But to see that come together in fifteen lines of python that out-do these massive, crazy expensive DNN models is bonkers, jaws on the floor material.

Severák

@mhoye what is it for? for detecting topic of the text without needing to costly train neural network?

Daneel Adrian Cayce

@mhoye holy fuck this is potentially brilliant, and I am so excited to dive into reading this after work

mhoye

@sysop "Code is available at $URL" right there in the abstract! Holy shit!

AK

@mhoye You had me at "novel application of gzip"

matzipan

@mhoye @jrconlin the abstract reads like a self aware joke. Please tell me it's a self-aware joke

bob

@mhoye this is text classification not text prediction

Emily S

@mhoye wait it works on a dataset in Pinyin?! Damn

And it's 14 lines of python

And it's not even looking at the contents of the compression, it's just using the byte lengths? Holy fucking shit that's smart as hell.

😍

Tatjana Scheffler

@mhoye compression methods have been used in text classification for authorship analysis for quite a while.

XaiaX

@tschfflr @mhoye yes, I remember discussion of this at least a decade ago and probably longer.

Still interesting.

XaiaX

@tschfflr @mhoye after looking through it it seems like they acknowledge all that as the basis for this work.

Albert Cardona

@mhoye Compression is ... very interesting. Recalling here Schmidhuber's take on compression and its predictability being at the root of learning, beauty, novelty, interestingness, boredom ... arxiv.org/pdf/0812.4360

Previously: mathstodon.xyz/@albertcardona/

allison

@mhoye reading this paper and cackling the whole time, it's just so clever

Ben Zanin

@mhoye (this is not particularly novel, I remember reading a similar paper about using gzip to de-anonymize written works back around when I was learning about singular value decomposition to support vector space document searching from Maciej Ceglowski, I think back in the early 2000s. But it definitely is novel that this old technique is now outperforming current tech darlings.)

Ben Zanin

@mhoye that prediction ≈ compression is also the basis of the Hutter Prize

en.m.wikipedia.org/wiki/Hutter

0x10f

@mhoye IIRC, a few years ago someone found a way to categorize music files (MIDI?) by testing which ones compress well together.

Simon Cozens

@mhoye This is awesome, but I'm surprised it wasn't better known. I have vague memories of going to a talk by a researcher in Oxford about 25 years ago about using gzip compression for text analysis. His presentation explained about entropy and how compression is prediction, then looked at categorising text by gzipping it. Can't remember the name; some guy doing inference stuff in the psychology department. This is going to bug me now.

mhoye

@simoncozens From what I can tell the fact of it wasn't a big secret, but the idea that with apparently negligible effort you can outperform tools that are insanely expensive and wildly more complicated is the interesting part.

Simon Cozens

@mhoye The right algorithm in the right place beats an inscrutable pile of ReLUs every damned time.

mhoye

@simoncozens I read this in the G-Man's voice from Half Life.

Angle

@mhoye ...That's mildly terrifying if it works out, actually. A major jump in AI capabilities. Eep. :/

mhoye

@Angle I don't think it's a jump in capabilities, just a massive decrease in cost and complexity, which is different (and democratizing!)

Adam Nelson

@mhoye So this works for classification, but can it be adapted to work for transformers with attention? I still can't fully get my head around how transformers work, so I don't know if this technique translates, but if it does... 🤯

tinyrabbit

@mhoye you can outperform a non-pre-trained deep neural network. I don’t understand text classification enough to know what the difference is there. ”Non-pre-trained” sounds like a key factor in that comparison though.

DELETED

@mhoye This is way beyond my understanding of text classification, but I can vaguely see what they are saying. How did nobody think of that before?

mhoye

@erik That part was fascinating, wasn't it? And, like, a mic drop in the middle of It's Raining Microphones. "Yeah, we tested it on Hanyu Pinyin, works great", and the whole AI community is sitting there saying "you what".

Go Up