⚠️ The Fediverse has been scraped, again ⚠️

Almost six million posts from 363 instances have been scraped.

"All the posts with public visibility published by users hosted on Mastodon servers [...] which support the English language" have been scraped along with their metadata, and the "policy, the code of conduct and the prohibited contents of each instance".

The dataset is an attempt at creating an open dataset for "research" into algorithms like the ones Facebook uses to identify problematic content, based around users' use of Content Warnings.

The dataset can be found here:

It was created by the University of Milan, Italy, apparently for the 13th AAAI:

The associated publishing:
aaai.org/ojs/index.php/ICWSM/a or likeable.space/media/30ae595a1 or DM me for a copy.

Related dataset:

Original post:
likeable.space/objects/98fe744 @tastytea


To see if your instance was scraped, check here:


Originally posted by cursed.technology/@tao/1034727 @tao@cursed.technology

@puffinus_puffinus @tao so could you stop them legally by adding something where the instance license every toot under something that this would be a violation? Technology wise if something is publicly broadcasted this will happen

@frickhaditcoming @puffinus_puffinus @tao
Probably adding some specific legal text would make it illegal, if it isn't already. But it would take a lawyer to figure out how exactly...

And while it might still be possible, I think being illegal is already a big step. This would mean "researchers" couldn't just publish a study by doing it. Or that law enforcement couldn't use this in a court of law. Or that it would be a risk for a very public business to do it.

Sign in to participate in the conversation
Sunbeam City 🌻

Sunbeam City is a Libertarian Socialist solarpunk instance. It is ran democratically by a cooperative of like-minded individuals.