Skip to content(if available)orjump to list(if available)

Show HN: A free, privacy preserving, archive of public Discord servers

Show HN: A free, privacy preserving, archive of public Discord servers

50 comments

·May 20, 2025

Hey HN!

I have been working on this project for a while, and I think this solves a problem that a lot of people here have: not being able to easily search Discord servers.

Currently, I only scrape servers that are marked as "discoverable" on Discord. However, if there's enough interest in the project, I'm open to adding specific servers by request. I'm primarily focused on informational servers rather than casual hangout spaces, such as open source projects, Minecraft mods, and support communities for tools, services, or platforms (for example, hosting providers).

I have placed restrictions on searching directly by user ID to prevent doxing. I also made the opt out process one click, for those who do not want to be archived.

This is my first large scale project, so I'd love to hear your feedback!

pabs3

searchcord

Hey,

This is interesting, I somehow missed this. Unfortunately, those are not full text searchable. Maybe I will download them and import them into Searchcord, with proper credit of course.

Thanks for this!

johnQdeveloper

> This is my first large scale project, so I'd love to hear your feedback!

> I have placed restrictions on searching directly by user ID to prevent doxing. I also made the opt out process one click, for those who do not want to be archived.

1) I'd suggest anonymizing the usernames / author ids to something more privacy friendly such as how some image sites were generating 3-4 random words as a human readable unique id. This removes a lot of the reason people would opt out (i.e. posts being tracked down years later)

2) You not seem to have a clear rate limit documentation. If you are asking people to pay for commercial use, I'd suggest making it clear what the rough original limits are as well as the rough price range of what you'd offer.

3) Tbh, the only real thing I want from this project is basically narrative / roleplay / writing content for LLM reasons as I'm trying to build a rules-oriented system that narrates via LLM. If you don't want people using this data for this purpose, I'd suggest making that clear.

searchcord

Hey,

Thanks for your suggestions.

> 1) I'd suggest anonymizing the usernames / author ids to something more privacy friendly such as how some image sites were generating 3-4 random words as a human readable unique id. This removes a lot of the reason people would opt out (i.e. posts being tracked down years later)

In the original iteration of Searchcord, it used to work similarly to that. The username was `sha256(userid+guildid)`, truncated to the first 8 characters. Unfortunately, it was pretty hard to follow chats. I will try your idea and see how it works, though.

> 2) You not seem to have a clear rate limit documentation.

This is a good idea. The rate limit varies by endpoint, and I haven't gotten around to documenting each one.

> If you are asking people to pay for commercial use, I'd suggest making it clear what the rough original limits are as well as the rough price range of what you'd offer.

I have absolutely zero idea what industry would be interested in this, in what form, and if anyone would even pay.

> 3) Tbh, the only real thing I want from this project is basically narrative / roleplay / writing content for LLM reasons as I'm trying to build a rules-oriented system that narrates via LLM. If you don't want people using this data for this purpose, I'd suggest making that clear.

I really don't care what people do with the data, as long as they are not spamming requests or using the data for commercial purposes without permission.

sReinwald

The sheer audacity here is quite something. You're stating people can't use your scraped data for commercial purposes "without permission," while your entire project is built on vacuuming up content from countless users without their permission, and in direct violation of Discord's ToS. That's not just a double standard; it's bordering on next-level cognitive dissonance.

And "privacy preserving"? With a one-click opt-out, that 99.999% of the affected users will never even know exists because they have no idea their conversations are now part of your archive, and you want it indexed by search engines? That's not "privacy preserving" - that's a bad joke. If privacy was a genuine concern, this project wouldn't exist in its current form. What you're offering is an opt-out fig leaf for a mass data harvesting operation.

Most people using Discord, even on "public, discoverable" servers, aren't posting with the expectation that their words will be systematically scraped, archived indefinitely, and made globally searchable outside the platform's context. It's a fundamental misunderstanding (or willful dismissal) of user expectations on what is essentially a semi-public, yet distinctly siloed, platform. This isn't an open-web forum where content is implicitly intended for broad public consumption and indexing.

Look, I get the frustration that (likely) motivated this. Discord has become an information black hole for many communities, and the shift away from open, searchable forums for project support is a genuine problem I've been incredibly frustrated with myself. But this "solution" - creating a massive, non-consensual archive that tramples over user privacy (and platform terms) - creates far graver ethical and practical issues than the one it purports to solve.

searchcord

> The sheer audacity here is quite something. You're stating people can't use your scraped data for commercial purposes "without permission," while your entire project is built on vacuuming up content from countless users without their permission, and in direct violation of Discord's ToS. That's not just a double standard; it's bordering on next-level cognitive dissonance.

Not really, it is not free to host and serve this data. If they want to get the data for free, they can get it directly from Discord. I did that work for them.

> And "privacy preserving"? With a one-click opt-out, that 99.999% of the affected users will never even know exists because they have no idea their conversations are now part of your archive, and you want it indexed by search engines? That's not "privacy preserving" - that's a bad joke. If privacy was a genuine concern, this project wouldn't exist in its current form. What you're offering is an opt-out fig leaf for a mass data harvesting operation.

Again, not really. It's impossible to search for users without already knowing what server they are in. This is functionally identical to Discord's in-built search feature.

> Most people using Discord, even on "public, discoverable" servers, aren't posting with the expectation that their words will be systematically scraped, archived indefinitely, and made globally searchable outside the platform's context. It's a fundamental misunderstanding (or willful dismissal) of user expectations on what is essentially a semi-public, yet distinctly siloed, platform. This isn't an open-web forum where content is implicitly intended for broad public consumption and indexing.

I believe that people need to realize that their messages were already being logged by many different moderation bots, just not publicized. This also happens on platforms like Telegram, look at the SangMata_BOT for example. Unless the messages are end to end encrypted, it was just a matter of time before they were scooped up and archived.

Thanks for your input, though, I really do want to build a platform that balances privacy and usability.

deakam

[dead]

Abishek_Muthian

Congratulations and all the best.

Could Searchcord API be useful for discord servers which want to archive their chats to their own website?

e.g. I have discord server for my product and I want to copy the Q&A threads to FAQ section of my product website will Searchcord be useful for that or are there better solutions?

belst

This clearly breaks discord TOS.

I did consent for discord to have my data, I did NOT consent to you having my data.

The discord TOS clearly state: > Our services might also provide you with access to other people’s content. You may not use this content without that person’s consent, or as allowed by law.

As I was not informed of the usage BEFORE it was taken, I could neither opt in nor opt out.

GDPR clearly states, even in the case of "legitimate interest" I have to be informed.

I only found this randomly, but if I hadn't, I had no idea of the data validation happening, so I couldn't opt out.

Technically, cool project. Legally, not so cool.

treyd

Would you consider making regular dumps of the database available in sharded torrents like Anna's Archive does so that users can back up the data themselves for preservation purposes? This would complicate retroactively removing users' activity, but that data could already be scraped.

And related, I'd like to be able to run this locally for exports of guilds that I'm on myself. Is that even possible with the architect you've built?

searchcord

Hey,

This is absolutely something I want to do, but at the guild level. The database itself is over 13TB which is much to large to create regular exports of. I will probably provide a SQLite export of each guild, regenerated each week/month. Anyone is free to download whatever they want in real time from the API.

Thanks for your question!

squigz

Big +1 for dumps

You might try reaching out to Anna's Archive and see if this would be a dataset they'd be interested in helping host/distribute. I think they'd agree that such data is important and should be archived.

hofrogs

This is really cool and actually useful for peeking behind those annoying login walls. What software do you use to store/index/search in so much data? How did you get the data in the first place? Discord isn't exactly known for letting its data be available easily. Have the administrators of the guilds asked you for this? Have you contacted them and made them aware after the fact?

searchcord

Hey,

Thanks for your feedback.

For software, I use ScyllaDB and Elasticsearch. It's split across 6 physical nodes (8 including the CDN). Data collection is handled using standard user accounts, accessing only public, discoverable servers. I plan to write a blog post about the technical aspect of how this was done soon.

Admins of these servers weren't contacted, as the content indexed is already publicly accessible, comparable to a forum like this or public subreddit. That said, I understand the sensitivity around data visibility, and I've made it very simple for any user to opt out of indexing at any time. Private or invite-only servers are, of course, completely excluded.

klntsky

I suggest you to remove the opt-out functionality and let it scrape private servers that it discovers via publicly posted invite links. You don't owe anyone posting on a public forum any privacy. Moreover, the most valuable data to search for is probably somewhat obscured.

searchcord

Hey,

Thanks for your suggestions. However, this does not work for a few reasons:

1. Joining servers is protected by increasingly difficult to solve captchas that have no commercially available solver. This is not a battle I want to fight.

2. There are a LOT of CSAM rings that spam invite links in public servers. This is also not something I want to go anywhere near.

Moreover, after the fallout of spy.pet, I think it is very important that users are able to opt out.

hofrogs

That's a lot of compute, how much does it cost to keep it running? I don't see how that project would generate any income on its own

searchcord

I already own the hardware, so I only pay for colocation and transit. It's probably a lot less than you think. I hope to find some way to monetize it, but it is cheap enough that I can keep it running for quite a long time without any income.

IceWreck

Do you plan to handle servers where you need to do some action (like send a message) to join all channels ?

I was scrolling through the home page and came across afew where the only channels you're allowed to access are the verify-yourself or welcome channels.

searchcord

Probably not. Discord will aggressively captcha you and every server has a different implementation of verification. It might be possible with a captcha solver and then some LLM to figure out the next steps.

Stagnant

Incredible work! Truly eye-opening to see how some rarer keywords in my native language return pages of relevant results. Meanwhile google gives 0 results or just AI/ad spam.

jonasdoesthings

Maybe also exclude messages by bots (e.g. "username has joined the server") from the index to decrease the stalking-potential of your site (99.9% of these bot messages have no informational-value for the index anyways). Currently you can still search for an username and get a subset of servers that the username is in (even if not active) by finding these bot messages.

searchcord

This is something I have looked into. Unfortunately, every server uses a different format/bot. It might be possible to develop some sort of machine learning classifier to determine if it is a welcome message.

3abiton

This is an amazing project, I always wonder how much information is lost in those chat apps, not only Discord, but also Telegram. The latter has hude dev community specifically around Android Rom Development, which migrated from forum based XDA to more flexible chat/support platform like Telegram. I wish that also can be searchable without having their client.

searchcord

Telegram is already heavily monitored and scraped due to the large volume of illegal or extremely controversial activity that happens there. This is something I will look into though, my XDA threads rarely get any replies anymore. Thanks for the suggestion!

legionof7

I've been looking for something like this for so long, thanks for making!

There's so much stuff locked in Discord now that forums have fallen in popularity, think this sort of thing really helps unlock that knowledge again.

searchcord

Thanks for your feedback! <3