@amin how do you rank pages ? You put bloggers, and writers before others content?
3 comments
Haha fair Actually reminds me of an analogy I made to @jp@lazybear.social earlier tonight about how searches were timing out for a bit after restoring a database dump: > The database is still gonna take a while collating and sorting and vacuuming everything before stuff starts completing on time though. It's like when you just got out of bed and haven't had caffeine yet |
@hyde
Good question!
First the actual SQL code is here if you want to pull it out and reference it: https://codeberg.org/Clew/Clew/src/branch/main/server.py#L46-L92
The main ranking function is BM25F, a modification of Okapi BM25. Essentially this is an easy measure of how strongly a page contains a keyword. I do some adding and multiplying to get a result for all keywords entered in.
Then short pages get a small penalty (planning to adjust this) and, crucially, sites with any detected ads or trackers get a 40% penalty in the rankings. That's small enough where if a page with ads/tracking genuinely is significantly more relevant than anything else, it'll show up first, but if there are any sites without ads/tracking that match everything they'll be prioritized.
That results in pretty much any content farms or SEO-optimized junk getting ranked poorly, and independent sites coming up to the top.
Also, the crawler does most of its page discovery via RSS/Atom/JSON feeds, so sites with high-quality feeds tend to get more representation in the index in the first place. :)
Any questions?
@hyde
Good question!
First the actual SQL code is here if you want to pull it out and reference it: https://codeberg.org/Clew/Clew/src/branch/main/server.py#L46-L92
The main ranking function is BM25F, a modification of Okapi BM25. Essentially this is an easy measure of how strongly a page contains a keyword. I do some adding and multiplying to get a result for all keywords entered in.