I am manually testing the algorithm. With a database of English, French, and Arab first names ~60kb, looking for a firstname `amarite` the algorithm returns the following top 10 results, best comes first:
- armitage
- armistead
- braithwaite
- amari
- marriott
- savatier
- lattimore
- antram
- waterman
- fairweather
The results are missing an obvious amari. Edit: Amari is in the test results.
The embedding weights are 11637 bytes = 4 bytes * 27 (vocabulary size) * 100 (embedding dimension)
And the index that holds the pre-computed embeddings for the firstnames for easy retrieval is : 16166831 bytes ~16 megabytes.
That is an indexing factor of 260. The last big 16 megabytes will be stored on disk.
I proceed on my quest to make a small index for doing typo correction. I looked into:
- https://github.com/m31coding/fuzzy-search that is a bag of tricks for doing fuzzy search, unclear yet to me how to keep the index on disk
- I looked into spotify's annoy, and also arroy that is written in rust that is done by french pop, and use LMDB https://github.com/meilisearch/arroy/
- I discovered deezymatch, that according to the README does what I want fuzzy search and look promising. I will need to look at the algorithm they use. https://github.com/Living-with-machines/DeezyMatch
- There is also resin a c# vector space database that I could look at https://github.com/kreeben/resin
Even if my prototype runs with an integrated GPU, it feels awkward given the current environment to push more of AI ML stuff.
I proceed on my quest to make a small index for doing typo correction. I looked into:
- https://github.com/m31coding/fuzzy-search that is a bag of tricks for doing fuzzy search, unclear yet to me how to keep the index on disk
- I looked into spotify's annoy, and also arroy that is written in rust that is done by french pop, and use LMDB https://github.com/meilisearch/arroy/