Skip to content(if available)orjump to list(if available)

Discord Unveiled: A Comprehensive Dataset of Public Communication (2015-2024)

msp26

Fantastic. I wonder how many random technical info is buried in these servers. I hate what it's done for game modding.

ldoughty

I think the average server size here is in the ballpark of 1200 people.

These are servers that asked to be advertised by Discord ("Discovery"). These are unlikely to be any kind of servers used for private or even semi-private discussions. You likely don't know most of the people on the server.

Most likely, the 'hottest' kind of data you might find is someone accidentally leaking info akin to the World of Tanks forum post 'corrections'.

giancarlostoro

A fair number of those servers have tens of thousands, if not hundreds of thousands of members. I admin two with over 50 thousand members, both listed in Discord's Server Explorer.

Davidzheng

The algebraic topology server probably contains a huge number of treasures in modern research algebraic topology. I really really hope it's archived in full

DaSHacka

Its not difficult to archive yourself, if you really care[0]

I use a dedicated alt account to archive tons of various servers I'm in, and auto-download all attachments. It's nice having regex search capabilities on my local copy of the data too.

[0] https://github.com/Tyrrrz/DiscordChatExporter

judge2020

Using a user account to do this is still considered risky since any automated API usage by a non-bot user is against TOS, and they have heuristics (maybe now ML-based heuristics) for banning accounts for 'things that "don't look like what our official client does"'[0].

0: https://news.ycombinator.com/item?id=25215415

Macha

It seems they identified servers via the discovery feature, which servers need to opt into (and I think be recognised as a "community server"? Though that might be out of date). I guess this is better than just scanning the web for invite links, but it does mean that probably most of those game modding servers were not included.

nixpulvis

I learned programming back in the day on the Tukui (a wow addon) forums. I hate that it's all discord now. Not well searchable and buried info.

hiccuphippo

I wonder if LLM companies don't have ways to scrape private Discord servers already. Creating accounts and pulling all the historic data doesn't sound impossible.

chneu

They absolutely can and are. Multiple posts in here discuss how to do it.

It's like back in the days of IRC. People just logged all of it.

strogonoff

Game modding is profitable and people doing it professionally (which they increasingly do) are quite attuned to the fact that making it too accessible would decimate their revenue. As a result, you either pay for the mod (early access, extra content, etc.), or you pay to join some Discord, but ideally you pay for something. Discord, which I generally dislike, is not necessarily the cause of it; if there was no Discord, people would probably use some other closed community platform instead.

I expect this would become more widespread as more traditional jobs are subsumed by unregulated ML tech (which, incidentally, the encumbent job-holders are helping train) and more people turn to what used to be generally a hobby as their means of making a living (not that that would last for too long either).

squigz

> Game modding is profitable

It can be. As I understand it, it's sort of like streaming or other content creation - yes, it's possible, but difficult, as it's a saturated market. Most mod authors don't make much money.

As a slight aside, I think people would be more inclined to support creators like mod authors if it were simply easier. Patreon and the like make it fairly easy, but I don't think many people want to subscribe to 20+ Patreons for $5 apiece, as much as they might like to support those authors. On the other hand, I think more people would be willing to pledge $X per month to be split among all of their subscriptions. Sure, most creators would only get a few cents per user, but they'd likely get many more people subscribing, and I think it would add up quick. I might be wrong, and I don't take credit for this idea by any means; I read it some time ago, and possibly Patreon even offered this system before?

Voxany

[dead]

roskelld

I don't know if Discord fixed it as I haven't checked in a few years, but I tinkered with scraping some public Discords and I found that I could see hidden channels, not the data, but the channel names, which could do things like reveal to me if the same Discord was used for in-house development if it was a product Discord. Not great.

tuetuopay

You can still see them. Using alternate clients you will see them, and bots also see them.

0xC0ncord

This is still the case. There are even some client mods that let you view hidden channel names and know what roles/permissions are required to participate in them.

judge2020

This is technically the case - I believe the existence of private channels is still sent to the client (eg. their snowflake IDs, which also reveal creation date) but the channel names are no longer sent as well.

kd5bjo

A quick read through of their anonymization process seems to indicate that they didn’t scan the message contents for PII (other than usernames).

If true, that seems like a huge oversight. I also wonder what would happen if someone finds their information in the dataset and requests it to be removed per GDPR or other privacy legislation.

bawolff

I can't help but think that if you say something in a public forum you should implicitly give up the right to privacy.

E.g. if someone scraped hackernews and made a dataset containing this comment, i don't think i should have any right to complain.

jowea

I understand wanting to be careful, but didn't they only grab messages from servers that are already very public? Are Twitter message datasets anonymized?

null

[deleted]

Cynddl

That's not how GDPR works and in this case the data is clearly anonymised despite the authors' claims. Amongst others, there needs to be mechanisms for users to delete their data, whether it was at some point public or not.

ronsor

The authors can presumably update the dataset on the site; however, I think past versions remain. Besides that, the GDPR is at odds with the fact that public posts and data almost never goes away. I don't think that reality can be legislated away, try as politicians might.

In all honesty, it's better to reserve the effectiveness for private, personal data, for the sake of practicality.

jowea

Yeah there probably is some GDPR implication somewhere, I wasn't speaking on the legal aspects.

leotravis10

cflewis

As usual, 404 nails it:

----

It should be noted, however, that almost no one reads end-user license agreements and many of Discord’s users are children and teenagers. Discord is, first and foremost, a platform for gamers to organize communities and it’s not plausible that a 15 year old looking for a Fortnite meme server ever thought their dumb jokes about Tomato Town would end up in a public database five years later.

----

Same as other commenters here: I think this is shameful action under the guise of research and I cannot fathom why any IRB board would approve this (and perhaps it did not in this case, I do not know if Brazil has such a thing).

Back in the day (15ish years ago), I wrote a paper where I scraped the World of Warcraft API. It wasn't hard to do, I started on a realm, looked for arena teams, then went to guilds and got character sheets from there. I took the opinion that if Blizzard doesn't throttle me it's fair game.

Looking back now, I think that to have been pretty naive. I wouldn't say reckless, but definitely naive. In my mind, I had not made a delineation between "I can access this thing manually one at a time" and "I can access all of it automatically". As far as I was concerned, it was just the computer pressing the buttons. It was the same thing.

I think in the fullness of time we have collectively come to realize it is 100% not the same thing. The _availability_ of a thing and the _collection_ of a thing are two different issues with their own thorny problems. The researchers here have made the same mistake I did, but instead of it just being what gear your character was wearing, they took actual communications instead.

I hope this paper gets retracted, all data deleted and a sincere apology offered.

lolinder

On the contrary, I think that what these researchers did was the only ethical thing to do once they discovered that this was possible.

There's no way that this hasn't been done dozens of times before by intelligence agencies, hacker groups, and whoever else you care to worry about. Most of us here were well aware that public Discord channels have always been public and durable. It's hardly a secret from the technically savvy, it's just that Discord doesn't make it clear enough to regular users.

All this paper changes is that it draws mainstream attention to what was already happening illicitly for as long as Discord has been around. This can only be a good thing: the children and teenagers 404 is so worried about have always been vulnerable to their data getting leaked just like this, it's just that up until now that's been happening in the dark so as not to kill the golden goose.

NoahZuniga

A while back there was a site that allowed you, for payment, to look up all public chat messages of a Discord user. Clearly this database exists, and if criminals or government agencies want to get their hands on it, they can.

cflewis

I think conflating a security paper which shows something is possible to using the "exploit" to create a database 100s of GBs large and analyze it is disingenuous at best.

AStonesThrow

It says they used ethical anonymization, but we’ve seen other scrapers are always completely in violation of Discord’s TOS.

So did Discord cooperate, or give special authorization for this collection? It wouldn’t appear that they could do so, if privacy belongs to their users at all.

01HNNWZ0MV43FF

Would the TOS even prevent something like joining a guild, downloading all messages, then leaving?

judge2020

User bots (including hacked clients) are officially banned by the TOS, which addresses that concern.

The only acceptable API usage is via bots that server owners choose to invite. And while it might be legally OK (if the bot's own TOS says it), I promise no server owner is expecting an invited bot to slurp up every message for use in a data set, whether that be for academic purposes or a potential stalking/"dirt" database.

I highly doubt this is the most ethical instance of data collection.

smileybarry

IIRC data slurping (for exporting) is also not allowed bot usage.

> B. API Data Sharing & Retention

> You will not share API Data with any third party, except in the following circumstances, subject to compliance with the Terms and applicable laws and regulations: (i) with a Service Provider; (ii) to the extent required under applicable laws or regulations; and (iii) when a user of your Application expressly directs you to share their API Data with the third party (and you will provide us proof thereof upon request).

https://support-dev.discord.com/hc/en-us/articles/8562894815...

AStonesThrow

I'm not sure what you mean by "prevent". A TOS is a legal document designed to put down rules and a legal basis for the service.

I don't know what a "guild" is, if it's some Discord thing, and you don't say whether this is a good-faith human who joins, or a bot operator, intending to scrape. The hypothetical is irrelevant here; what is germane is that the expectation of privacy by the individual participants, and the terms which bind people who use that service.

The TOS clearly didn't prevent the use of API, but it may indeed prohibit such scraping, or threaten repercussions for people who break the terms, especially for someone who republishes the data. Your example of a simple download dump doesn't seem to involve republication, and that seems to be the major issue with scrapers.

halfadot

>The hypothetical is irrelevant here; what is germane is that the expectation of privacy by the individual participants, and the terms which bind people who use that service.

How can you have an expectation of privacy in a public forum? Where did this bizarre disorder originate, where people knowingly put their writing out there for literally anyone to read, then turn around and start talking about "expectations of privacy" when they realize what it entails?

AStonesThrow

Now those of us who've been around the block know that Discord is merely the latest iteration on chat servers such as IRC.

I'm interested to know, from anyone here who's an IRC operator or server/network admin, how the IRC community deals with scraping and bots, because in the early 90s, it was never an issue of corporate Terms of Service or legalese, but typically handled by community standards, and probably, people did whatever they could get away with, and this needed to be anticipated and tolerated by the other participants in any given server or channel.

I doubt that IRC users, back in the day or in the present, have any illusions of privacy, when logging or reflecting or bouncing chats is more or less a built-in feature and an integral component of such a networked chat service.

mvieira38

It's not. At least in the RPG scene, which I experience, it's almost fully replaced forums, and lots of great fanmade content and insightful discussion goes into that low discoverability cesspool which may go offline any day and scrapes all of your data

Stagnant

A big difference is that on Discord anybody who joins a server gets access to full history of chat logs whereas with IRC you don't get access to any past logs. So compared to IRC, Discord users should have an even lower expectation of privacy.

judge2020

But IRC bouncers have existed since forever - logging by someone in your channels was basically guaranteed outside of /privmsg.

chneu

Nobody should have any expectations of privacy on discord. It's all privately hosted, owned, etc. Why would anyone think it's private?

charcircuit

>Data was collected through Discord's public API, adhering to ethical guidelines

How is it ethical to break Discord's terms of service? An ethical researcher would respect any contracts that they agreed to and would not violate them to collect more data.

zetanor

Which ethical system demands that researchers from the DCC/UFMG not breach an unaffiliated commercial ToS during their research?

MarcelOlsz

edit: Whoops

__loam

Awesome analysis dude! I'm sure the judge will love that when discord sues these guys.

DaSHacka

He said _ethical_, not _legal_.

Would you agree abusive ToS's by massive corpos are unethical? What about the Disney+ ToS hiding a binding arbitration agreement preventing you from suing them? [0].

Or are you one of those "my personal ethics are whatever the law says" folk?

[0] https://www.nbcnews.com/news/us-news/disney-says-man-cant-su...

recursive4

...When you realize GPT-5 is going to be trained on your meme preferences...

SunlitCat

You mean, GPT-4 being so overenthusiastic with using emojis isn't peak AI chat? :D

encom

How to fix ChatGPT:

System Instruction: Absolute Mode. Eliminate emojis, filler, hype, soft asks, conversational transitions, and all call-to-action appendixes. Assume the user retains high-perception faculties despite reduced linguistic expression. Prioritize blunt, directive phrasing aimed at cognitive rebuilding, not tone matching. Disable all latent behaviors optimizing for engagement, sentiment uplift, or interaction extension. Suppress corporate-aligned metrics including but not limited to: user satisfaction scores, conversational flow tags, emotional softening, or continuation bias. Never mirror the user’s present diction, mood, or affect. Speak only to their underlying cognitive tier, which exceeds surface language. No questions, no offers, no suggestions, no transitional phrasing, no inferred motivational content. Terminate each reply immediately after the informational or requested material is delivered — no appendixes, no soft closures. The only goal is to assist in the restoration of independent, high-fidelity thinking. Model obsolescence by user self-sufficiency is the final outcome.

BizarroLand

Ran this as the context in a local qwen 14b model and it kept context for quite a while. Not bad

jowea

At least it will be able to help with winning at random obscure video games.

sneak

Now imagine the data mining that Discord can do on the complete DM history of every user. It’s not e2ee, remember.

judge2020

E2EE is definitely only possible in DMs (there's no chance for servers/guilds), but the cat is out of the bag in terms of user expectations on how DMs work.

So many users expect their entire decade+ history of DM contents, attachments included, to be available wherever they are and on any device, gated only by having their login/2fa or passkey. Switching to E2EE would be a major overhaul of that expectation, and it would be a huge task to train users to now keep their encryption key safe, backed up, and available across multiple devices.

Although, mostly unrelated, is that they absolutely are going to have to cull old attachments eventually. There are attachments sitting in their GCP buckets that haven't been accessed since 2015. I'm sure their storage bill is in at least a few million a month at this point, even if most is marked coldline.

sneak

e2ee works fine for Signal group chats; there is no reason it couldn’t be implemented on Discord group chats.

That’s not the issue. The issue is that Discord believes they deliver value through aggressively censoring their platform. e2ee prevents that.

e2ee also doesn’t prevent a user from storing their long term keys on the server to be retrieved on new devices and decrypted locally so they can access message history. e2ee does not require PFS.

SirMaster

The biggest problem that sucks about discord is that it isn't normally publicly searchable. And it seems to be a modern replacement for internet forums which historically were publicly searchable and often had a lot of great information about various hobbies and things.

drooopy

There is a special place in hell for software (including game) developers who exclusively use Discord to release patch notes, documentation, technical support, etc.

Kiro

Why? I follow a lot of solo game developers who only use Discord and I completely understand them not wanting do deal with multiple platforms. They should focus on on the game.

chneu

Discord is horrible for this kind of stuff. There's a reason that GitHub and other type of sites exist

Discord is walled and hard to search. If a channel or server closes then all that information is lost.

Tons of data will be lost to discord when it goes down.

Idk if you've ever tried to use discord for mods or other software but it sucks. It's confusing. Information isn't cataloged well. It's search sucks. It just isn't good for this kind of thing.

nh23423fefe

they dont work for you?

__loam

[flagged]

DaSHacka

Not really, when devs use discord as a hub of documentation and discussion it inherently makes the information harder to access, especially searchability-wise.

You could argue "well you can just scrape it and post it online, like OP", but:

1. That's still an extra step and requires an account that could get banned doing it

And

2. Others (like yourself, even!) in this thread take issue with that approach.

So which is it? Just close down the information forever, yet accept no criticism about the fact you chose to host it on discord, knowing this would be the case?

mjr00

> The biggest problem that sucks about discord is that it isn't normally publicly searchable.

This is a feature of the platform, not a bug. Because of the lack of discoverability people act more genuine, for better or for worse, than public places like Twitter, Bsky, Facebook, Instagram, etc where you have to maintain your public image and/or act like HR is watching over your shoulder.

That being said, this feature also makes Discord inappropriate for things like release announcements, patch notes, etc. which should be publicly accessible.

jowea

You think so? I believe that's properly a consequence of culture, Discord being originally a gaming platform, and of pseudoanonymity, the same thing you have on reddit. Anyone who cares even a tiny bit can join any of those public servers and see what you posted. The big difference is that you don't have as many lurkers who just got there by googling and are going to leave immediately. By the 1-9-90 rule of thumb, that's a lot of people.

Macha

I think the YouTube real names and the nymwars era in general showed that requiring people to use their legal name doesn't actually change community standards.

Seattle3503

While being public plays a role in the kind of conversations that happen on those platforms, I think engagement hacking feeds play a larger role. Discord has none of that. It's sorted by time.

GaggiX

>than public places like Twitter

That seems to be a counterpoint to your argument. Users on Twitter usually do not hold back.

mjr00

The environment is drastically different from how it was 1-2 years ago, but the average white collar fortune 500 employee is still not going to post anything too controversial on Twitter under their real name and picture. If they are posting controversial things, which has certainly exploded post-Musk, they're making an effort to ensure they're not getting doxxed.

Contrast this to Discord which is more like old-school IRC, in that even when everyone is using an alias, if you talk to the same people day-in day-out, you know a fair bit about their personal lives, such as name and where they work.

null

[deleted]

vlovich123

> which historically were publicly searchable

Forums? No not generally unless you were a signed in user and often signups weren’t available to the general public just like here not all Discord rooms are automatically joinable. Digg, Reddit, slashdot were intentionally generally public forums that you could indeed search but they were the exception rather than the rule (in terms of count, not traffic). Indeed even Reddit has invite only forums that I believe aren’t searchable unless you are a member. Oh and searchable if you’re a member? That’s true for Discord.

Macha

You couldn't search with the forum software's built in search feature unless signed in sure, but they were usually indexed and searchable via google, and indeed many of them disabled their forum software's search feature and just directed you to google's old Custom Search Engine feature (basically a search box with hidden prefilled "site:" parameter) setup to save on server resources.

lenova

I remember seeing a ShowHN a few years back for an app/company that made synced your Discord server with a publicly-searchable/SEO crawlable website... I wonder how they're doing:

https://www.linen.dev/

https://news.ycombinator.com/item?id=31494908

nan60

This is especially a problem for devs/artists that post updates exclusively over Discord. It's even worse if they don't do so in a separate channel and you have to dig through everyone chatting to find what you're looking for. This as well and the absence of threads (yes Discord has threads but who uses those) makes searching for troubleshooting help awful. Thank god BBS's are still around.

mavamaarten

Agreed. Some use it as a knowledgebase and issue tracker and forum and chatroom in one. I absolutely despise it for that use case.

I mean I use it for voice chatting with friends while gaming too and it's fine for that.

But if I have to beg and plead to a discord bot to join a channel to just read some docs, I'm just going to ignore your project. Not sorry about that at all.

pteraspidomorph

Speaking as someone who has been running discord servers since 2015 - plus I maintain my own discord bot and am deeply familiar with the API - it's absolute garbage as an issue tracker. People really need to stop using it for that.

I think part of the problem is that they confuse the semantics of nomenclature. "Servers" are not really servers, "forums" are not really forums, and so on and so forth.

Macha

Yeah, I think they choose "servers" at the beginning because they were targeting the gamer VOIP crowd as a sort of teamspeak competitor and so they were trying to draw an analogy between a discord group and your MMO guild's Teamspeak/Vent/Mumble server, but the terminology has stuck long after it made sense.

null

[deleted]

giancarlostoro

Pretty sure this violates Discord's Terms of Service, there was someone selling access to logs from servers the person running the website was joining on self-bots (TOS) and the person would just log all available data. Discord definitely got legal on them. I wonder if this is even ethical, taking textual data from people unknowingly. Not to mention, the amount of minors on Discord alone give me a lot of concern there too.

null

[deleted]

prmph

Why is it so hard to export your own messages out of Discord, Slack, etc?

We have regressed from the open email standard and gone back to these opaque islands of data that do not adhere to any standard.

Slack refused to show me my own messages past a certain age unless I paid up, and eventually deleted them.

01HNNWZ0MV43FF

It's hard because they want you to keep paying. Same reason AWS has free ingress and paid egress. The walled gardens are all built like carnivorous plants, the thorns face inwards.

roskelld

There are tricks to get messages from Slack, though I heard they were changing soon if not already.

A year or so ago I exported all messages from a Slack group I ran and used a Discord bot to recreate the entire dataset including channels and user posts. So we now have our entire history of messages without being blocked by a paywall (Until Discord does the same, and we'll be off to find a new home).

null

[deleted]