Top-level
5 comments
@mhoye @XaiaX > spicy autocomplete huh, nah the topic here is text classification, which is similar to text prediction (and there is probably a way to use Huffman tables to produce suggestions, etc.), but not the same. > So, maybe a wikipedia dump and a gutenberg mirror is plenty? probably yes. (imo coolest would be producing a tool that could both classify and predict using the same infrastructure (light preprocessed large text dumps (<10GiB), massive improvement) @mhoye @XaiaX interesting thing is that it is probably *much* easier to simultaneously get the information for text completion *and* also what authors were involved in that match (instead of a large pool of authors just the relevant subset). [in the case of these compression-decompression + reference dataset models] Idle thought that LaTeX equations rendering properly would be a pretty good use case for a Mastodon client, given the overall geeky clientele on here! cc. @ivory |
@fogti @XaiaX So, I don't actually think that acres of storage space is all that meaningful if you just want to focus good-enough utility. We all kind of know you're not getting more than varying heats of spicy autocomplete out of these tools, and the core insight of statistics as a field is that you can make a very accurate approximation of the state of a large dataset out of a small subset of that data. So, maybe a wikipedia dump and a gutenberg mirror is plenty?