aziz's wall

Hey, four years after asking on Stack Overflow, still no answers but mine on how to do approximate neighbors (ANN) search for Unicode strings given a bigger than memory dataset...!

The goal is to build a spellchecker (some kind of search engine) for concepts, and entities that is fast, small, and cost-efficient. The input space is bigger than common-sense words.

My current approach is to rely on simhash, but that requires to index permutations of the hash, that in turn results in an index factor of 4: for every byte indexed, I need around 4 more bytes to make the initial byte easy to discover. (also, not sure, but I think I recall that it does not perform much better than indexing permutations of the input string)

Today, I felt #machinelearning and started #emacs 🖨️ with #pytorch 🐍 🔦. There is a documentation with a tutorial (!) that almost did what I need: https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html

It works locally on my intel integrated GPU 🙂

Still requires some work (and #help?), but I have a prototype that gives good enough results.

My goal is to create a character embedding / vector space where I can project ASCII, and eventually Unicode user input, then I can use an ANN algorithm with HNSW to match the input text string to known text strings.

So far, the matrix that holds the weights that makes it possible to eventually compute the query vector is only 1/6th of the original data, to that I will need to add the HNSW structure stored in an OKVS. I estimate that the index will be twice the size of the original dataset, there should be some saving. In other words, It is promising.

I also need to code a benchmark game.

#everybody must know: I want to do that with a scheme, maybe #racket.

Hey, four years after asking on Stack Overflow, still no answers but mine on how to do approximate neighbors (ANN) search for Unicode strings given a bigger than memory dataset...!

The goal is to build a spellchecker (some kind of search engine) for concepts, and entities that is fast, small, and cost-efficient. The input space is bigger than common-sense words.

Expand text...

Like 19 May at 16:38 | Open on functional.cafe

aziz

I am manually testing the algorithm. With a database of English, French, and Arab first names ~60kb, looking for a firstname `amarite` the algorithm returns the following top 10 results, best comes first:

- armitage
- armistead
- braithwaite
- amari
- marriott
- savatier
- lattimore
- antram
- waterman
- fairweather

The results are missing an obvious amari. Edit: Amari is in the test results.

The embedding weights are 11637 bytes = 4 bytes * 27 (vocabulary size) * 100 (embedding dimension)

And the index that holds the pre-computed embeddings for the firstnames for easy retrieval is : 16166831 bytes ~16 megabytes.

That is an indexing factor of 260. The last big 16 megabytes will be stored on disk.

- armitage
- armistead
- braithwaite
- amari
- marriott
- savatier
- lattimore
- antram
- waterman
- fairweather

Expand text...

23 May at 9:27 | Open on functional.cafe

Show 1 reply

aziz

📣 I am working on a content-addressable #scheme #programming language with lineage, and #i18n.

If you are up to for some more.

Parentheses, peace, convo, patch, and lovers are welcome!

Like 17 Jul 2023 at 14:29 | Open on functional.cafe

David Wilson

@aziz I think @abcdw might be interested to hear about it

17 Jul 2023 at 15:18 | Open on fosstodon.org

ᵒᵏ wakest

@aziz @nasser did you see this?

17 Jul 2023 at 16:47 | Open on social.wake.st

Show 2 replies