Hey, four years after asking on Stack Overflow, still...

All posts aziz's posts Post Back to profile

Hey, four years after asking on Stack Overflow, still no answers but mine on how to do approximate neighbors (ANN) search for Unicode strings given a bigger than memory dataset...!

The goal is to build a spellchecker (some kind of search engine) for concepts, and entities that is fast, small, and cost-efficient. The input space is bigger than common-sense words.

My current approach is to rely on simhash, but that requires to index permutations of the hash, that in turn results in an index factor of 4: for every byte indexed, I need around 4 more bytes to make the initial byte easy to discover. (also, not sure, but I think I recall that it does not perform much better than indexing permutations of the input string)

Today, I felt #machinelearning and started #emacs 🖨️ with #pytorch 🐍 🔦. There is a documentation with a tutorial (!) that almost did what I need: https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html

It works locally on my intel integrated GPU 🙂

Still requires some work (and #help?), but I have a prototype that gives good enough results.

My goal is to create a character embedding / vector space where I can project ASCII, and eventually Unicode user input, then I can use an ANN algorithm with HNSW to match the input text string to known text strings.

So far, the matrix that holds the weights that makes it possible to eventually compute the query vector is only 1/6th of the original data, to that I will need to add the HNSW structure stored in an OKVS. I estimate that the index will be twice the size of the original dataset, there should be some saving. In other words, It is promising.

I also need to code a benchmark game.

#everybody must know: I want to do that with a scheme, maybe #racket.

Like 19 May at 16:38 | Open on functional.cafe

2 comments

aziz

I am manually testing the algorithm. With a database of English, French, and Arab first names ~60kb, looking for a firstname `amarite` the algorithm returns the following top 10 results, best comes first:

- armitage
- armistead
- braithwaite
- amari
- marriott
- savatier
- lattimore
- antram
- waterman
- fairweather

The results are missing an obvious amari. Edit: Amari is in the test results.

The embedding weights are 11637 bytes = 4 bytes * 27 (vocabulary size) * 100 (embedding dimension)

And the index that holds the pre-computed embeddings for the firstnames for easy retrieval is : 16166831 bytes ~16 megabytes.

That is an indexing factor of 260. The last big 16 megabytes will be stored on disk.

- armitage
- armistead
- braithwaite
- amari
- marriott
- savatier
- lattimore
- antram
- waterman
- fairweather

Expand text...

23 May at 9:27 | Open on functional.cafe

aziz

I proceed on my quest to make a small index for doing typo correction. I looked into:

- https://github.com/m31coding/fuzzy-search that is a bag of tricks for doing fuzzy search, unclear yet to me how to keep the index on disk

- I looked into spotify's annoy, and also arroy that is written in rust that is done by french pop, and use LMDB https://github.com/meilisearch/arroy/

- I discovered deezymatch, that according to the README does what I want fuzzy search and look promising. I will need to look at the algorithm they use. https://github.com/Living-with-machines/DeezyMatch

- There is also resin a c# vector space database that I could look at https://github.com/kreeben/resin

Even if my prototype runs with an integrated GPU, it feels awkward given the current environment to push more of AI ML stuff.

I proceed on my quest to make a small index for doing typo correction. I looked into:

- https://github.com/m31coding/fuzzy-search that is a bag of tricks for doing fuzzy search, unclear yet to me how to keep the index on disk

- I looked into spotify's annoy, and also arroy that is written in rust that is done by french pop, and use LMDB https://github.com/meilisearch/arroy/

Expand text...

26 May at 11:29 | Open on functional.cafe