@fogti @XaiaX So, I don't actually think that acres...

@fogti @XaiaX So, I don't actually think that acres of storage space is all that meaningful if you just want to focus good-enough utility. We all kind of know you're not getting more than varying heats of spicy autocomplete out of these tools, and the core insight of statistics as a field is that you can make a very accurate approximation of the state of a large dataset out of a small subset of that data. So, maybe a wikipedia dump and a gutenberg mirror is plenty?

Like 13 Jul 2023 at 21:34 | Open on mastodon.social

3 comments

Τοπάζ Αλαιν Φογτια Αννα Εμιλια

@mhoye @XaiaX > spicy autocomplete

huh, nah the topic here is text classification, which is similar to text prediction (and there is probably a way to use Huffman tables to produce suggestions, etc.), but not the same.

> So, maybe a wikipedia dump and a gutenberg mirror is plenty?

probably yes. (imo coolest would be producing a tool that could both classify and predict using the same infrastructure (light preprocessed large text dumps (<10GiB), massive improvement)

13 Jul 2023 at 21:38 | Open on chaos.social

mhoye

@fogti @XaiaX I'm a bit more interested in predictive text tools that give you stylistic nudges towards artists you admire, and finding a way to get artists paid for that. Smart autocomplete/smart fill tools that answer, "what might Degas have done right here?"

13 Jul 2023 at 21:53 | Open on mastodon.social

Τοπάζ Αλαιν Φογτια Αννα Εμιλια

@mhoye @XaiaX interesting thing is that it is probably *much* easier to simultaneously get the information for text completion *and* also what authors were involved in that match (instead of a large pool of authors just the relevant subset). [in the case of these compression-decompression + reference dataset models]

13 Jul 2023 at 21:55 | Open on chaos.social