⚠️ The Fediverse has been scraped, again ⚠️

Almost six million posts from 363 instances have been scraped.

"All the posts with public visibility published by users hosted on Mastodon servers [...] which support the English language" have been scraped along with their metadata, and the "policy, the code of conduct and the prohibited contents of each instance".

The dataset is an attempt at creating an open dataset for "research" into algorithms like the ones Facebook uses to identify problematic content, based around users' use of Content Warnings.

The dataset can be found here:

It was created by the University of Milan, Italy, apparently for the 13th AAAI:

The associated publishing:
aaai.org/ojs/index.php/ICWSM/a or likeable.space/media/30ae595a1 or DM me for a copy.

Related dataset:

Original post:
likeable.space/objects/98fe744 @tastytea


"We are able to collect a large amount of data containing dif-ferent kinds of content produced by individuals from around the world. This fact rises some considerations about the privacy of the Mastodon users, that must be taken into account. In particular, the JSON response about a toot contains plenty of information about the user who has published the post. Since the Mastodon user may be unaware of their data being public and reusable for research purposes we disposed of the information about the users and we fully anonymized them by hashing the Mastodon user identifier. This latter aspect might limit the reuse and the fusion of this dataset with the topology of the Mastodon social network. To overcome these limitations and to integrate our dataset with the dataset about the Mastodon social network we previously released, we re-hashed the node/user identifier in the latter dataset, so that the same individual in both datasets corresponds to the same hash code."

"The gathering and the usage of public available data is never explicitly mentioned, consequently our data collection seems to be complaint with the policy of the instance. Moreover if the server of an instance is in the EU or the EEA we also fulfill the requirements of the GDPR since we do not store and release personally identifiable information of the users. Finally, we have also respected the limitations imposed by the robots.txt files of the different instances."

@puffinus_puffinus they store the URI of each toot, containing the username, therefore it's not anonymized at all

This is ridiculous and very worrying. It is never mentioned... so instead of asking they assumed they are allowed?

@puffinus_puffinus Um., their anonymizing is worthless since the original uri is present in each of them. This is basically a dump of the JSON that goes to each server.

I did a search on my id and got several toots.

