⚠️ The Fediverse has been scraped, again ⚠️

Almost six million posts from 363 instances have been scraped.

"All the posts with public visibility published by users hosted on Mastodon servers [...] which support the English language" have been scraped along with their metadata, and the "policy, the code of conduct and the prohibited contents of each instance".

The dataset is an attempt at creating an open dataset for "research" into algorithms like the ones Facebook uses to identify problematic content, based around users' use of Content Warnings.

The dataset can be found here:

It was created by the University of Milan, Italy, apparently for the 13th AAAI:

The associated publishing: or or DM me for a copy.

Related dataset:

Original post: @tastytea

"We are able to collect a large amount of data containing dif-ferent kinds of content produced by individuals from around the world. This fact rises some considerations about the privacy of the Mastodon users, that must be taken into account. In particular, the JSON response about a toot contains plenty of information about the user who has published the post. Since the Mastodon user may be unaware of their data being public and reusable for research purposes we disposed of the information about the users and we fully anonymized them by hashing the Mastodon user identifier. This latter aspect might limit the reuse and the fusion of this dataset with the topology of the Mastodon social network. To overcome these limitations and to integrate our dataset with the dataset about the Mastodon social network we previously released, we re-hashed the node/user identifier in the latter dataset, so that the same individual in both datasets corresponds to the same hash code."

"The gathering and the usage of public available data is never explicitly mentioned, consequently our data collection seems to be complaint with the policy of the instance. Moreover if the server of an instance is in the EU or the EEA we also fulfill the requirements of the GDPR since we do not store and release personally identifiable information of the users. Finally, we have also respected the limitations imposed by the robots.txt files of the different instances."

I don't know enough about privacy and online data laws to say, but is there anything that instances can do? What is public is not necessarily for the public eye. I am assuming people will take issue with this dataset including their toots, anonymised or not. It's likely also a personal security issue, because a few large json files that include millions of toots is probably a useful resource to police forces.

The irony of course is that the "inappropriate" toots that the study focuses on are just toots that use a Content Warning. So, they throw jokes that use the reveal as a punchline, and posts containing eye contact, in with the same toots they suspect are problematic and breach the code of conduct of instances.

The premise is incoherent! Problematic toots are taken down by the admins. The Content Warned toots that were scraped are permitted! A Content Warning does not mean the toot has been identified as problematic or a breach of the instance's code of conduct by the user! The dataset seems to be of little use for what I understand to be its intended purpose.

Also, what the fuck is with .social's Content Warning word cloud? They processed the words used in content warnings to brinf them back to root definitions and came up with cauliflower???

So quite a few people interacted with me with regard to the Milan Fediverse scraping. It would be good if we could write a complaint and request for the data to be taken down, as well as highlighting its uselessness for the intended purpose? And find out if there are legal issues around scraping data in this way that we can use to our advantage.

@puffinus_puffinus they store the URI of each toot, containing the username, therefore it's not anonymized at all

This is ridiculous and very worrying. It is never mentioned... so instead of asking they assumed they are allowed?

@puffinus_puffinus Um., their anonymizing is worthless since the original uri is present in each of them. This is basically a dump of the JSON that goes to each server.

I did a search on my id and got several toots.

@puffinus_puffinus Surely this is a mass copyright violation? I’m tempted to send a takedown notice.

@qyliss I think it is needed. The Fediverse needs to respond in some way. There are emails you can message the creators with in the pdf.

Time to sleep here but ran across this and I'd like to ask you about the concept of the right of "big data" property. I mean, individual public data is one thing, with its own privacy policies but mass data exploitation is some complete different thing, a very powerful one. Do you know any texts around this idea? I'd like to explore it as I've been thinking on that for a while now.

Thank you for the great toots and inspiration, both of you!

@OviOne @puffinus_puffinus @qyliss There's a journal called Big Data and Society that publishes some interesting articles on such topics.

@pizza_pal @OviOne @puffinus_puffinus @qyliss Also Dannah Boyd, Shoshonna Zuboff, and Rob Kitchin are some authors whose arguments I find to be valuable.

I'm opposed to an actor scraping the entire sphere, for purposes of an analysis that will not be fully returned, mirror-fashion, to the communities whose behaviour traces have been systematically syphoned off . .
@puffinus_puffinus @qyliss

. . by an industrial (military?) strength machine which is not in any way equivalent to the ordinary 'public' access of actual persons to actions of other persons-in-public. In a world with bots (and other assymetrical real world extraction of public traces by un-public agencies) some defence against this kind of violation of social norms is needed.
This is not 'privacy'. This is society-destroying power.
Nothing extracted, that is not returned to source.
@puffinus_puffinus @qyliss

How are terms of service supposed to actually inhibit this kind of practice? Who's gonna sue? Is the fediverse really going to take a violator to court?
Seems to me, this is outside the law. I guess it's bot wars?
@puffinus_puffinus @qyliss

That is pretty accurately my point of view too. It's a war about power when it comes to the control of that next level data, the "big data". Whoever gets it and exploits it, gain power. Maybe the first step is being conscious about our right as a community to keep that data exploitation to ourselves.

It seems to me like a really complex social/technical problem. And yeah, I don't see a solution through laws.
@puffinus_puffinus @qyliss

> I don't see a solution through laws
I'm puzzled by the way the fediverse anarcho-libertarian culture seems willing to accept property law (licensing, terms of service, etc) as a possible solution. Puzzled too, to find myself closer to the bomb-throwing (main force) stream of anarchism when it comes to asserting the force of the commons vs abusers. No explosives involved :)
I do feel stewarding of commons - not property not ownership - is the frame.
@puffinus_puffinus @qyliss

it is public data, i mean anyone even facebook can just scrape and analyze it
they just made the fact that they do it public


The point is not that it can be done anyway. Sure, I already know that. The point that it has been done publicly.

@puffinus_puffinus I'm also concerned these students have created and published a framework of tools (as well as the dataset) that cops/feds (or anyone else) could make use of to monitor the fediverse.

Maybe not in Europe as there isn't even anything /that/ controversial here but perhaps in USA and some other more authoritarian countries.

Especially with a large amount of traffic from sex workers indexed where they may be operating in a legal grey area..

@aidalgol @puffinus_puffinus

TBH I wonder if these techniques have /already/ been used in some parts of the World to discourage use of the Fediverse for activism (whilst maybe even tolerating cat pictures, shitposts etc), in places where English or other European languages aren't as widely used and tech workers are dependent on govt/big companies for employment and there is a smaller dataset to go through... >>

@aidalgol @puffinus_puffinus I know this is "tinfoil hat" type theory but I never use language filters and still occasionally browse federated timelines and its interesting which languages *aren't* there as well as which ones are (considering that Mastodon has been Unicode clean for about 2 years so its possible to write in just about any World language on here)

@vfrmedia @puffinus_puffinus Nothing tinfoil hat about anything you said. These are very real threats.

@vfrmedia @aidalgol That's a good point actually. I have put it down to this kind of technology being a more Western thing (America, Europe, Japan and so on) but it's possible there's more to it.

@puffinus_puffinus @aidalgol I think access to tech, gadgets, cars, consumer goods is much more widely available to middle class tech workers in "non Western" countries these days (I was lurking on an Indian motoring forum long before many Indians started to use the Fediverse), but once people have these they are less likely to push back against their society - especially if a fairly comfortable lifestyle could be quickly taken from them if they choose "the wrong path"..

@vfrmedia @puffinus_puffinus

you can be pretty sure they're already doing it in europe

the french government passed a law recently allowing them to scrape all social media posts from french citizens, supposedly as a way to fight against tax fraud, but worded in such a way that they can do pretty much anything with the data

(and usually when such a law is passed in france it's to retroactively protect databases they've been building and operating illegally for years)

@vfrmedia I'm pretty sure cops/feds/nsa/<your own stasi> already have the same tools.

(cc @puffinus_puffinus)

they do, but not always the time/budgets to deploy them.

This is a ready made dataset of a million toots for them to use that has made this use much easier.

BTW in the UK its the BBC (yes, our state broadcaster!) which does the social media mining!


@puffinus_puffinus there's a broad consensus here to CW food to avoid triggering eating disorders as well as "cursed food" toots/posts (normally CW'ed), trying to process CW's (and maybe even the content of toots?) in this manner is likely to turn up a lot of food items (consider that a /lot/ of people toot about food at least daily)

@vfrmedia I guess. I'm interested (in an angry sort of way) to see how their processing found the root of food content to be "cauliflower" though

@puffinus_puffinus I'm curious too, but I don't have the tools (or skills) to process that JSON (even looking through one bit of it on Notepad++ lags up my computer due to its size) and I'm not sure if the methodology to get *all* the data in the paper (other than the raw data) has been published

@vfrmedia @puffinus_puffinus Oh is *that* why people CW food. I've always wondered about that.

@puffinus_puffinus Probably some English to Italian to English translation problem.

@puffinus_puffinus they seem to have fundamentally misunderstood what content warnings are for.

@puffinus_puffinus the cauliflower one was just from me talking about my dick.

(scrape that, you clowns.)

@anarchiv thanks Souv, always good to see some funny nonsense in my replies XD (this toot brought to you by the "helpful" reply guys in me menchies atm)

@puffinus_puffinus I love how somehow Goldeen is on there?! A fucking Pokémon. 😅 😂 🤣

@puffinus_puffinus Cauliflower is the most triggering vegetable of all. This is known.

@puffinus_puffinus I'm just glad to see my avant garde music projects "Greenhouse Cauliflower" and "inaudible cheeseSteak" are getting the attention they deserve

I feel like privacy concerns aside, these people should lose their funding bc they just sick at research. How are you going to purport to draw conclusions based on a dataset you don't even understand? It's like you started w the premise that only high status individuals wear hats so you end up w a dataset that classifies everyone who wears a baseball cap as upper class.

@puffinus_puffinus If we can mark instances as not wanting to be indexed by search engines (but still having to hope that they respect it, which I doubt that they do), we should at least be able to signal that the content is not meant for use off-instance

@david I wonder if it's worth adding a clause like this to the Code of Conduct...

@puffinus_puffinus I mean they wouldn't be legally binding in any way, but could signal intentions at least?

@puffinus_puffinus It's especially so because as we said in another thread, this list is almost entire left-wing instances. There's only about a half dozen "free speech" instances on there and the major ones (Gab, Spinster, Kiwi Farms, Librem, Free Speech Extremist, etc.) aren't there.

So it's not just a good resource for law enforcement, it's a literal honeypot for any dictatorship like the USA or China looking to shut down dissent.

@puffinus_puffinus @tao so could you stop them legally by adding something where the instance license every toot under something that this would be a violation? Technology wise if something is publicly broadcasted this will happen

@frickhaditcoming @puffinus_puffinus @tao
Probably adding some specific legal text would make it illegal, if it isn't already. But it would take a lawyer to figure out how exactly...

And while it might still be possible, I think being illegal is already a big step. This would mean "researchers" couldn't just publish a study by doing it. Or that law enforcement couldn't use this in a court of law. Or that it would be a risk for a very public business to do it.

@puffinus_puffinus @tastytea
This is, simply put, anti-ethical.

Since it is supposedly a scientific study, I would suggest contacting the review board of the university (or something like that).

Participation in scientific studies is not something trivial. Usually it is necessary to get a signed form with free and informed consent; implied consent should not be acceptable. And both the allowed uses and the handling of the data are very restricted.

