Email or username:

Password:

Forgot your password?
Top-level
aziz

I am manually testing the algorithm. With a database of English, French, and Arab first names ~60kb, looking for a firstname `amarite` the algorithm returns the following top 10 results, best comes first:

- armitage
- armistead
- braithwaite
- amari
- marriott
- savatier
- lattimore
- antram
- waterman
- fairweather

The results are missing an obvious amari. Edit: Amari is in the test results.

The embedding weights are 11637 bytes = 4 bytes * 27 (vocabulary size) * 100 (embedding dimension)

And the index that holds the pre-computed embeddings for the firstnames for easy retrieval is : 16166831 bytes ~16 megabytes.

That is an indexing factor of 260. The last big 16 megabytes will be stored on disk.

1 comment
aziz

I proceed on my quest to make a small index for doing typo correction. I looked into:

- github.com/m31coding/fuzzy-sea that is a bag of tricks for doing fuzzy search, unclear yet to me how to keep the index on disk

- I looked into spotify's annoy, and also arroy that is written in rust that is done by french pop, and use LMDB github.com/meilisearch/arroy/

- I discovered deezymatch, that according to the README does what I want fuzzy search and look promising. I will need to look at the algorithm they use. github.com/Living-with-machine

- There is also resin a c# vector space database that I could look at github.com/kreeben/resin

Even if my prototype runs with an integrated GPU, it feels awkward given the current environment to push more of AI ML stuff.

I proceed on my quest to make a small index for doing typo correction. I looked into:

- github.com/m31coding/fuzzy-sea that is a bag of tricks for doing fuzzy search, unclear yet to me how to keep the index on disk

- I looked into spotify's annoy, and also arroy that is written in rust that is done by french pop, and use LMDB github.com/meilisearch/arroy/

Go Up