Τοπάζ Αλαιν Φογτια Αννα Εμιλια

@XaiaX @mhoye but that's almost already the case, existing models are pretty large. I'd worry more about the computational complexity / representational examples for classification against (O(|to_classify| * |reference_dataset| * maxlen(to_classify {minkowski_plus} reference_dataset))

Like 13 Jul 2023 at 21:30 | Wall-to-wall | Open on chaos.social

5 comments

mhoye

@fogti @XaiaX So, I don't actually think that acres of storage space is all that meaningful if you just want to focus good-enough utility. We all kind of know you're not getting more than varying heats of spicy autocomplete out of these tools, and the core insight of statistics as a field is that you can make a very accurate approximation of the state of a large dataset out of a small subset of that data. So, maybe a wikipedia dump and a gutenberg mirror is plenty?

13 Jul 2023 at 21:34 | Open on mastodon.social

Τοπάζ Αλαιν Φογτια Αννα Εμιλια

@mhoye @XaiaX > spicy autocomplete

huh, nah the topic here is text classification, which is similar to text prediction (and there is probably a way to use Huffman tables to produce suggestions, etc.), but not the same.

> So, maybe a wikipedia dump and a gutenberg mirror is plenty?

probably yes. (imo coolest would be producing a tool that could both classify and predict using the same infrastructure (light preprocessed large text dumps (<10GiB), massive improvement)

13 Jul 2023 at 21:38 | Open on chaos.social

mhoye

@fogti @XaiaX I'm a bit more interested in predictive text tools that give you stylistic nudges towards artists you admire, and finding a way to get artists paid for that. Smart autocomplete/smart fill tools that answer, "what might Degas have done right here?"

13 Jul 2023 at 21:53 | Open on mastodon.social

Τοπάζ Αλαιν Φογτια Αννα Εμιλια

@mhoye @XaiaX interesting thing is that it is probably *much* easier to simultaneously get the information for text completion *and* also what authors were involved in that match (instead of a large pool of authors just the relevant subset). [in the case of these compression-decompression + reference dataset models]

13 Jul 2023 at 21:55 | Open on chaos.social

Matt Stratford

@mhoye @fogti @XaiaX

Idle thought that LaTeX equations rendering properly would be a pretty good use case for a Mastodon client, given the overall geeky clientele on here! cc. @ivory

13 Jul 2023 at 23:04 | Open on mastodon.social