so I checked Washington Post's article about Google C4 dataset and used their tool to navigate in it and I gotta say that Google started scraping mastodon instances too. I don't know how they do that, but they did it. they even have their teeth in vulpine.club and meow.social
albeit it accounts for very little data and is less than 1% of all tokens though.
that being said. wow. fuck.
Preferences of mastodon -> preferences -> other -> Opt-out of search engine indexing
turn on this