Email or username:

Password:

Forgot your password?
5 comments
~/hyde

@amin how do you rank pages ? You put bloggers, and writers before others content?

Amin Hollon 🇺🇸🇲🇾🇮🇳🇦🇫

@hyde

Good question!

First the actual SQL code is here if you want to pull it out and reference it: https://codeberg.org/Clew/Clew/src/branch/main/server.py#L46-L92

The main ranking function is BM25F, a modification of Okapi BM25. Essentially this is an easy measure of how strongly a page contains a keyword. I do some adding and multiplying to get a result for all keywords entered in.

Then short pages get a small penalty (planning to adjust this) and, crucially, sites with any detected ads or trackers get a 40% penalty in the rankings. That's small enough where if a page with ads/tracking genuinely is significantly more relevant than anything else, it'll show up first, but if there are any sites without ads/tracking that match everything they'll be prioritized.

That results in pretty much any content farms or SEO-optimized junk getting ranked poorly, and independent sites coming up to the top.

Also, the crawler does most of its page discovery via RSS/Atom/JSON feeds, so sites with high-quality feeds tend to get more representation in the index in the first place. :)

Any questions?

@hyde

Good question!

First the actual SQL code is here if you want to pull it out and reference it: https://codeberg.org/Clew/Clew/src/branch/main/server.py#L46-L92

The main ranking function is BM25F, a modification of Okapi BM25. Essentially this is an easy measure of how strongly a page contains a keyword. I do some adding and multiplying to get a result for all keywords entered in.

~/hyde

@amin I just woke up ... But will have some let me grab a ☕

Amin Hollon 🇺🇸🇲🇾🇮🇳🇦🇫

@hyde

Haha fair

Actually reminds me of an analogy I made to @jp@lazybear.social earlier tonight about how searches were timing out for a bit after restoring a database dump:

>

The database is still gonna take a while collating and sorting and vacuuming everything before stuff starts completing on time though. It's like when you just got out of bed and haven't had caffeine yet

Go Up