@ionizedgirl @aires As someone running a small web hosting company, yes, you can. There are at least three organizations aggressively crawling the web to build an index: TikTok (Bytedance), Amazon, and {not-sure-yet}. They are ignoring rate limits imposed by robots.txt and generating ridiculous server loads. [Except for Bytedance, because after they masked their user agent I've blocked every damn data centre they're running on with a firewall.] Amazon's crawler is pathologically bad, too.
@alan @ionizedgirl @aires Are they actually building an index, or are they scraping content for AI training datasets? Given they seem to ignore sitemaps, I’m inclined to think the latter.