Good question!
First the actual SQL code is here if you want to pull it out and reference it: https://codeberg.org/Clew/Clew/src/branch/main/server.py#L46-L92
The main ranking function is BM25F, a modification of Okapi BM25. Essentially this is an easy measure of how strongly a page contains a keyword. I do some adding and multiplying to get a result for all keywords entered in.
Then short pages get a small penalty (planning to adjust this) and, crucially, sites with any detected ads or trackers get a 40% penalty in the rankings. That's small enough where if a page with ads/tracking genuinely is significantly more relevant than anything else, it'll show up first, but if there are any sites without ads/tracking that match everything they'll be prioritized.
That results in pretty much any content farms or SEO-optimized junk getting ranked poorly, and independent sites coming up to the top.
Also, the crawler does most of its page discovery via RSS/Atom/JSON feeds, so sites with high-quality feeds tend to get more representation in the index in the first place. :)
Any questions?
@amin I just woke up ... But will have some let me grab a ☕