Email or username:

Password:

Forgot your password?
2 posts total
aziz

Hey, four years after asking on Stack Overflow, still no answers but mine on how to do approximate neighbors (ANN) search for Unicode strings given a bigger than memory dataset...!

The goal is to build a spellchecker (some kind of search engine) for concepts, and entities that is fast, small, and cost-efficient. The input space is bigger than common-sense words.

My current approach is to rely on simhash, but that requires to index permutations of the hash, that in turn results in an index factor of 4: for every byte indexed, I need around 4 more bytes to make the initial byte easy to discover. (also, not sure, but I think I recall that it does not perform much better than indexing permutations of the input string)

Today, I felt #machinelearning and started #emacs ๐Ÿ–จ๏ธ with #pytorch ๐Ÿ ๐Ÿ”ฆ. There is a documentation with a tutorial (!) that almost did what I need: pytorch.org/tutorials/beginner

It works locally on my intel integrated GPU ๐Ÿ™‚

Still requires some work (and #help?), but I have a prototype that gives good enough results.

My goal is to create a character embedding / vector space where I can project ASCII, and eventually Unicode user input, then I can use an ANN algorithm with HNSW to match the input text string to known text strings.

So far, the matrix that holds the weights that makes it possible to eventually compute the query vector is only 1/6th of the original data, to that I will need to add the HNSW structure stored in an OKVS. I estimate that the index will be twice the size of the original dataset, there should be some saving. In other words, It is promising.

I also need to code a benchmark game.

#everybody must know: I want to do that with a scheme, maybe #racket.

Hey, four years after asking on Stack Overflow, still no answers but mine on how to do approximate neighbors (ANN) search for Unicode strings given a bigger than memory dataset...!

The goal is to build a spellchecker (some kind of search engine) for concepts, and entities that is fast, small, and cost-efficient. The input space is bigger than common-sense words.

aziz

I am manually testing the algorithm. With a database of English, French, and Arab first names ~60kb, looking for a firstname `amarite` the algorithm returns the following top 10 results, best comes first:

- armitage
- armistead
- braithwaite
- amari
- marriott
- savatier
- lattimore
- antram
- waterman
- fairweather

The results are missing an obvious amari. Edit: Amari is in the test results.

The embedding weights are 11637 bytes = 4 bytes * 27 (vocabulary size) * 100 (embedding dimension)

And the index that holds the pre-computed embeddings for the firstnames for easy retrieval is : 16166831 bytes ~16 megabytes.

That is an indexing factor of 260. The last big 16 megabytes will be stored on disk.

I am manually testing the algorithm. With a database of English, French, and Arab first names ~60kb, looking for a firstname `amarite` the algorithm returns the following top 10 results, best comes first:

- armitage
- armistead
- braithwaite
- amari
- marriott
- savatier
- lattimore
- antram
- waterman
- fairweather

aziz

๐Ÿ“ฃโ€‹ I am working on a content-addressable #scheme #programming language with lineage, and #i18n.

If you are up to for some more.

Parentheses, peace, convo, patch, and lovers are welcome!

Go Up