Skip to content(if available)orjump to list(if available)

Show HN: A blocklist to remove spam and bad websites from search results

Show HN: A blocklist to remove spam and bad websites from search results

87 comments

·January 14, 2025

Hi HN!

I've been fed up with search results so much that I decided to make a giant blocklist to remove garbage links by using uBlacklist.

I browsed other blocklists and wasn't very satisfied from what exists now; the goal of this one is to be super organized and transparent, explaining why each site was blocked via issues. Contributions welcome!

Even though around 100 domains are blocked so far, I already noticed a big improvement in casual searches. You'd be surprised how some AI generated websites can dominate the #1 page on DuckDuckGo.

cormorant

I'm fed up too. Spammy, AI-looking sites are showing up more and more. For some reason, many of them use the same Wordpress theme with a light gray table of contents - they look like this: https://imgur.com/a/totally-not-ai-generated-efsumgZ

The problem seems worse on "alternative" search engines, e.g. DuckDuckGo and Kagi, which both use Bing. It's been driving me back to Google.

A blocklist seems like a losing proposition, unless, like adblock filter lists, it balloons to tens of thousands of entries and gets updated constantly.

Unfortunately, this kind of blocklist is highly subjective. This list blocks MSN.com! That's hardly what I would have chosen.

popcar2

Even Google is plagued by spam, I've tried all sorts of search techniques and alternative engines but I feel like the only solution seems to be doing things manually. I was already starting to block things by myself but I thought it'd be more productive to make the list public and try crowdsourcing. Even now, searching "how to partition a hard disk" would often drive you to low-effort sites telling you to use their software.

> Unfortunately, this kind of blocklist is highly subjective. This list blocks MSN.com! That's hardly what I would have chosen.

It's definitely a bit opinionated, but it's open to discussion - you can create an unblock request issue (if you care enough to do so, of course!). The reason I blocked MSN is that it just re-hosts articles from other websites, so I'd rather see the official source than be tricked into Microsoft's site which is very annoying, like how it opens another article if you scroll too fast down.

maximilianthe1

Recently learned a little trick for google. Adding `-ai` at the end of query helps. Not much, but something.

radicality

Afaik DDG is just Bing, whereas Kagi is using Google, Bing, (Yandex?) among others - https://help.kagi.com/kagi/search-details/search-sources.htm...

As a Kagi user I actually haven’t encountered much search result spam, surprised you’re seeing enough there to drive you back to Google!

rendaw

I get tons when looking up recipes and cooking related information. Things that will say "X can be refrigerated for up to two weeks" then in the next paragraph "X is fine to refrigerate and eat for 2-3 days" or similar.

I'd block them but there seem to be infinite. They're probably buying 10+ character domains using random words/names/phrases in bulk.

econ

I was just thinking... Depending on the type of articles one can pretty decently describe what makes it a good one. Recipes should be short texts that may link to a gallery, a video and to a text about it. They should have a section called ingredients and one for preparation and may have an author and a date. Research articles should cite sources elaborately.

BigGreenJorts

> Unfortunately, this kind of blocklist is highly subjective. This list blocks MSN.com! That's hardly what I would have chosen.

I'm wondering how much the blacklist can be broken down into categories of spam. Sponsorblock for YouTube has a lot options around the types of things it'll skip through and the user has choice in how they're handled (skipped automatically, prompted to skip, simply highlighted in the scrubbar) at the category level.

nosioptar

You can use ublacklist without a list and just block shit sites as you see them.

I'm loving being able to search for something without getting results from garbage sites like howtogeek, stackoverflow, MSN, Pinterest, etc.

Llamamoe

Since when is stack overflow a garbage site?

Ringz

Installed! This should not be a function of the search engine nor a plugin. This should be integrated in the browser.

Another great function (not for this plugin) should be the option to "bundle" all search results from the same domain. Stuff them under one collapsible entry. I hate going through lists and pages of apple/google/synology/sonos/crab urls when I already know that I have to search somewhere else.

troyvit

You could get a step closer to that and integrate it into your DNS: https://github.com/StevenBlack/hosts

The upside is that it would go beyond your browser to anything on your machine that makes a DNS request.

> Another great function (not for this plugin) should be the option to "bundle" all search results from the same domain. Stuff them under one collapsible entry.

That would be really cool. Just zip it up if you don't want to see that domain for that specific search.

LeoPanthera

It's not going to be long before we need to move to a whitelist model, rather than a blacklist model.

It ironically makes me think of the Yahoo Web Directory in the 90s.

Time is a flat circle.

manx

This. I think, well curated web directories (by humans and machines) deserve a comeback.

dredmorbius

Yes and no.

Power-law relations mean that a small number of domains will account for the lion's share of low-relevance results, and filtering those out will result in dramatic improvements in relevance.

That small set is probably fairly dynamic, however, and will likely change at a fairly high rate over time.

Penny-ante sites are less likely to appear in generic results, but might well be whatever the spam/phish term is for junk general Web search results.

We may well come to rely more on whitelisting, but I think at least for now that's not necessary, largely due to the dynamics of publishing / attention economies themselves.

antithesis-nl

So, if you already run uBlock Origin (and of course you are), you can use this list without installing any additional extensions by going to 'Filter lists' in the uBlock settings, then Import, then enter https://raw.githubusercontent.com/popcar2/BadWebsiteBlocklis... as the URL.

Not saying you should, just that you could...

popcar2

I think this would block you from visiting the websites, but they'd still show up on search results. uBlacklist doesn't block them, but rather just hides them for search engines which IMO is a better approach.

antithesis-nl

Yeah, I just tested this, and you're right. Going to google.com and entering solveyourtech as a search term, did indeed still return their site as a result.

On clicking it, uBlock blocked my visit, but that may or may be not enough for you, in which case an additional plugin may be warranted.

gtfiorentino

Hi @popcar2 — how are you sourcing the domains for the blocklist? We'd like to evaluate those domains and consider whether they should be removed from DuckDuckGo as spam. You can also report a site directly in the search results by clicking the three-dot menu next to the link and selecting "Share Feedback about this Site".

popcar2

Hi! I'm mostly going through them manually. Not all of the domains in the list are literally spam - most of the list also includes misleading sites like corporate blogs that trick people into downloading their software.

You might be interested in the AI spam/low effort section though, one that tops DDG often are these AI generated tech articles: https://github.com/popcar2/BadWebsiteBlocklist/issues/1

They're the same site under different domains, you can tell it's AI by its writing style, how much they churn out per day, how little info there is about who's writing it, how similarly the about pages are written, and how the same article is suspiciously also in similar-sounding sites.

Another one I just caught today that was on top of page 1: https://github.com/popcar2/BadWebsiteBlocklist/issues/84

I'll be sure to report these sites as I'm adding to the list, thanks.

gtfiorentino

Got it, that's good to know, thanks! I've added the domains in the AI spam / low effort section to our list of user reports for review.

nashashmi

I feel like lots of sites that are censored by the bing and google are also being censored by DDG. Is there really any point of filtering out more sites? Search results are so meaningless now. All results simply point to the same 200 links page after page.

james-bcn

With the Kagi search engine is a way in the settings to bulk-upload lists of domains to block (or upvote) them. Has anyone uploaded a list like this to it?

I may do that.

freedomben

That was my thought as well. Their UI is great for one-at-time operations, but an API endpoint I could curl and sync with a local file I keep in git would be killer.

Although, using this via the extension would make it cross-platform so the block affects kagi and google, which could be nice.

Although, that would require manual syncing between devices, which would not be nice.

Although, uploading it to kagi through API doesn't mean I have to not use the extension, so having the cake and eating it too may be possible.

thoughtpalette

Was thinking of that as I was browsing this doc! I just did the ole' reddit.com -> old.reddit.com redirect via kagi yesterday.

shortformblog

The problem with a list like this is that a “bad website” is in the eye of the beholder. I’m not saying that there’s anything wrong with you personally not liking the Shopify or the Semrush blog. But I think that everyone else has their own calculus.

It’s the same reason why social media blocklists can be problematic—everyone’s calculus is different.

My suggestion is that you promote it as a starter and suggest that users fork it for their own needs.

swayvil

Some kind of democratic process. Where membership and blacklist are both something arrived at democratically.

It could be simple.

Good?

shortformblog

Seems like a kit that can be personalized across broad categories might be a better bet. By putting the onus on one list you don’t solve the main problem, which is that the list might block things you’re fine with.

manx

Some kind of community notes consensus system could make sense here to find common ground. When a diverse set of people agree that a site should belong to the list, only then it is added.

edm0nd

I recently started a crypto scam/phishing blocklist if you wanna roll these into your list as well.

also works well with Pi-hole and other platforms.

https://github.com/spmedia/Crypto-Scam-and-Crypto-Phishing-T...

the_snooze

This is one of those features a proper search engine (i.e., not a thinly-veiled advertising network) should have. If users can customize their search results and share their sorting/filtering methods, then that presents a large number of constantly-moving targets that greatly drives up the cost of SEO. There's no "making the Google algorithm happy." Instead, it becomes more "making the users happy."

bityard

Google used to do this years ago but clawed it back around the time they started removing _all_ customizations under the premise of, "we know how to customize your results better than you do."

DuckDuckGo has site blocking. The problem is that there are so many SEO-optimized blogspam, referral link, and other "garbage" sites that you could spend a lifetime blocking each one individually before you get any actual work done. And it's only getting worse now that LLMs can generate a whole web site for you in a matter of minutes. I imagine a dedicated individual could provision several thousand websites/blogs per day, just chock full of ads and referral links.

Kuinox

I don't understand why so much corporate blogs are blocked. Most of them are about their product, or about the industry in general.

- For example, kaspersky blog doesn't look bad.

- CCleaner blog is just a list of update.

popcar2

They aren't, that's why they're blocked. You can see more detail in the issues but the blocked corporate blogs often make clickbait to advertise their product, like https://www.ccleaner.com/knowledge/windows-11-problems-how-t...

owenthejumper

I think that's misleading on it's own and that makes this list useless. From that logic you should block every single corporate blog out there.

This looks like someone's personal list not a serious effort.

popcar2

If you'd like to search for how to fix a technical problem and want the first page of results to be articles saying "download/buy our product to fix it now!", then the list probably isn't for you.

There are quite a few company blogs I haven't blocked, mainly ones that are actually informative and aren't trying to trick you into looking at their products.

Llamamoe

> From that logic you should block every single corporate blog out there.

I considered this previously. I feel like the web would be a vastly improved experience if you just blocked everything affiliated with a corporation as opposed to a university, nonprofit, or a personal site.

Timwi

> you should block every single corporate blog out there.

That is indeed something I'd want.

> This looks like someone's personal list not a serious effort.

It is the OP’s personal list and they were completely open about that.

HackerThemAll

> This looks like someone's personal list not a serious effort.

This has just started. Instead of whining, contribute, the more people contribute, the more "serious effort" it will become.

Kuinox

And Kaspersky ?

jwx48

Because corporate blogs are predominantly nothing but marketing fluff that dominates search results so thoroughly that they drown out any actual useful information.

MortyWaves

Who’s going to be the first to make the PR for Medium and “dev.to”?

CamperBob2

Why Medium?

bluetidepro

Likely because the annoying paywall to most of Medium.

nayuki

Related: Freya Holmér - "Generative AI is a Parasitic Cancer" https://www.youtube.com/watch?v=-opBifFfsMY (1h19m54s) [2025-01-02].

She talks at length about how pages of AI-generated nonsense text are cluttering search results on Google and all other search engines.

huesatbri

She really put my thoughts into words regarding this. The “who is this for” part really hit home.