Scraperr – A Self Hosted Webscraper
56 comments
·May 11, 2025lucb1e
edoceo
What do you have for log analytics and ban automation? Could you say more about how to identify these bad-bots?
lucb1e
There is no automation, I use `tail -f access.log`
I just look at what's happening on my server every now and then. Sometimes not for months, but then when I set up a project like that caching proxy, I'm currently keeping a more regular eye to see that crawlers aren't bothering the upstream via me. Most respect the robots policy, most of the ones that don't set a user agent string that include the word 'bot' and so I know not to refresh the cache based on that request. So far it has mostly been Huawei who pretend to be a regular user but request millions of pages (from 12 separate IP ranges so far, some of them bigger than /16, some of them a handful of /24s).
> Could you say more about how to identify these bad-bots?
Many requests per day to random pages from either the same IP address (range), or ranges owned by the same corporation
VladVladikoff
What sort of pages require 20 seconds to generate? This is extremely slow by most web standards and even your users would be frustrated by this. It sounds like poorly designed database queries with unindexed joins.
Google will also abandon page loads that take too long, and will demote rankings for that page (or the entire site!)
beatthatflight
So what about flight searches where we have to query several 3rd party providers, and can take 45 seconds to get results from all of them (out of my control). I can dynamically update the page (and do) but a scraper would have to wait 20-45 seconds to get the 'cheapest' flight from my site. I can add async the queries and have the fastest pipes, but if the upstream providers take their time (they need to query their GDSs as well), there's not much you can do.
lucb1e
> It sounds like poorly designed database queries with unindexed joins
Neither of those assumptions are correct. As an example, one page needs to look through 2.5 million records to find where the world record holder changed because it provides stats on who held the most records, held them for the greatest cumulative time, etc. The only thing to do would be introducing caching layers for parts of the computation, but for the number of users this system has, it's just not worth spending more development time than I already have. Also keep in mind it's a free web service and I don't run ads or anything, it's just a fan project for a game
> Google will ... demote rankings for that page (or the entire site!)
Google employs anticompetitive practices to maintain the search monopoly. We need more diversity in search engines, I don't know how else to encourage people to use something instead of, or at least in addition to, Google, besides by making Google Search just not competitive anymore. Google's crawler cannot access my site in the first place (but their other crawlers can; I'm pretty selective about this). My sites never show up in Google searches, on purpose
It's also not the whole site that's slow, it's when you click on a handful of specific pages. If that makes those pages not appear in search results, that's fine. Besides that it's not my loss, it's not like any other site has the info so people will find their way to the main page and click on what they want to see
VladVladikoff
Like I said then, you need indexes on those columns which you filter on in this table. Search a table of 2.5 million records for a value is still blazing fast if you use indexes correctly. I’m talking about 0.01 seconds or less. Even with tables much larger.
I agree about Google being shit. However, my website makes my living, and feeds and clothes my children, so I have to play along to their rules, or suffer.
Please take your slowest performing query and run it with EXPLAIN in front. And share that (or dump it into an LLM and it will tell you have to fix it)
selcuka
> It sounds like poorly designed database queries with unindexed joins.
I find it amusing that you think every database operation imaginable can be performed in less than 20 seconds if we throw in a few indexes. Some things are slow no matter how much you optimise them.
The GP could have implemented them as async endpoints, or callbacks, but obviously they've already considered those options.
throwup238
It's the kind of prescriptive cargo culting that is responsible for a significant fraction of pain involved in software engineering, right up there with DRY and KISS and shitty management.
I bet the GP abstracts out a function the second there's a third callsite too, regardless of where it's used or how it will evolved - only to add an options argument and blow up the cyclomatic complexity three days later.
smartmic
My preferred "self-hosted" webscraper is a local, single binary called xidel [1]. The feature I really like is that it can also follow links.
andrethegiant
Shameless plug: prefix any URL with https://pure.md/ to get the pure markdown of that page. Useful for direct piping into an LLM. Has bot detection avoidance, proxy rotation, and headless JS rendering built in.
matt-p
That's excellent pricing from a structural perspective.
renegat0x0
Not a web scraper, but a web crawler software. Allows to specify method of crawling, selenium, and others. Returns data in JSON (status code, text contents, etc).
nomilk
I used to scrape back in the day when it was easy (literally just make a request and parse html). Seems cloudflare checkboxes / human verification are very commonplace nowdays. Curious how(/if) web scrapers get around those?
welanes
1. Clicking the box programmatically – possible but inconsistent
2. Outsourcing the task to one of the many CAPTCHA-solving services (2Captcha etc) – better
3. Using a pool of reliable IP addresses so you don't encounter checkboxes or turnstiles – best
I run a web scraping startup (https://simplescraper.io) and this is usually the approach[0]. It has become more difficult, and I think a lot of the AI crawlers are peeing in the pool with aggressive scraping, which is making the web a little bit worse for everyone.
[0] Worth mentioning that once you're "in" past the captcha, a smart scraper will try to use fetch to access more pages on the same domain so you only need to solve a fraction of possible captchas.
nomilk
That's awesome. Thanks for sharing.
First time hearing of the fetch() approach! If I understand correctly, regular browser automation might typically involve making separate GET requests for each page. Whereas the fetch() strategy involves making a GET for the first page (just as with regular browser automation), then after satisfying cloudflare, rather than going on to the next GET request, use fetch(<url>) to retrieve the rest of the pages you're after.
This approach is less noisy/impact on the server and therefore less likely to get noticed by bot detection.
This is fascinating stuff. (I'd previously used very little javascript in scrapes, preferring ruby, R, or python but this may tilt my tooling preferences toward using more js)
Tokumei-no-hito
first time hearing about fetch too. but i don't see the advantage. is fetch reusing the connection and a manual page load not?
therein
Almost. I mean it's not like fetch(..) is going to lead to some esoteric kind of HTTP request method. I am guessing parent comment is saying what it is saying because fetch will utilize the cookies and other crumbs set by the successful completion of the captcha. If you can take all those crumbs and include it in your next GET request, you don't need to resort to utilizing fetch.
gruez
>Seems cloudflare checkboxes / human verification are very commonplace nowdays. Curious how(/if) web scrapers get around those?
You can get a real browser[1] to check the box for you, then use the cookies in your "dumb" scraper.
nicman23
i usually use a real browser that i use, profile and all
anxman
By clicking the box
TheTaytay
Does anyone know of a scraper that uses LLMs/natural language to build a deterministic, robust script that I can use to scrape the same site in the future? All of the natural language extractors I’ve seen so far need an LLM every time, but that seems unnecessary…
throwup238
llm-scraper [1] does a decent job but it's still a bit fragile. The biggest problem I have is all the React CSS-in-JS libraries that use hashes in their class names, which the LLM isn't smart enough to ignore.
cdolan
What have you had success doing with this? Curious to test it
throwup238
I mostly use it to aggregate event calendars for all the concert/sport/etc venues, meetups, and clubs in my area and do some other scraping tasks. I host a little wrapper around llm-scraper on a DigitalOcean droplet that I call from Val.town scripts
I only check most places once a week so I use the LLM to do the scraping but there are a few cases where I have to scrape thousands of pages very frequently so I use the more deterministic script it generates instead.
TheTaytay
Nice! Thanks!
cdolan
We’ve built one internally using browser-use to generate playwright code
Works ok. Not as automated as I’d like
nicman23
they are all quite bad
gitroom
pretty cool seeing people still tweak their own scraping tools, but the cat and mouse game never ends huh - you think the web ever gets more open again or just keeps locking down?
tommica
Well, it won't get more open by us just bitching here and doing nothing else
tengbretson
Anyone have any experience webscraping from a Starlink IP? My assumption is you could stay under the radar due to cg nat, but it's not exactly something I want to be the first to find out about.
lyjackal
Mobile 4g USB sticks you can usually rotate your IP address by reconnecting. I tried on a pi, it was inconsistent. This was just with some random test mobile plan from rando carrier renting off Verizon I think
dewey
Seems much easier to just pay for a rotating proxy pool.
3abiton
> extract data from websites with precision using XPath selectors.
I've used XPath for crawling with selenium, and it used to be my favorite way, but turned out quite unreliable if you don't combine it with other selectors as certain website are really badly designed and have no good patterns. So what's the added value over pure selenium?
vivzkestrel
does this implement a rotating proxy IP address service?
iSloth
Interesting, wish it had markdown output like firecrawl for embedding/llm use cases
_QrE
Is there a reason for using Selenium over something like Playwright? I haven't had very many positive experiences with selenium, and playwright I found is easier to use and more flexible.
Also, for stuff like this:
`modified_value = original_value.replace("HeadlessChrome", "Chrome")`
There's quite a few ways to figure out that a browser is a bot, and I don't think replacing a few values like this does much. Not asking you to reveal any tricks, just saying that if you're using something like Playwright, you can e.g. run scripts in the browser to adjust your fingerprint more easily.
windexh8er
If you're a fan of Playwright check out Crawlee [0]. I've used it for a few small projects and it's been faster for me to get what I've needed done.
kristopolous
It's by apify which is an interesting community
jpyles
I am quite aware, but I actually built most of the scraping logic a long time ago, before I even knew that playwright was a thing.
I am looking to refactor a lot of this, and switching over to playwright is a high priority, using something like camoufox for scraping, instead of just chromium.
Most of my work on this the past month has been simple additions that are nice to haves
michaeljx
I was in a similar boat with my scrapers. Started with Selenium 5-6 years ago and only discovered Playwright 2 years ago. Spend a month or so swapping the two, which was well worth it. Cleaner API, async support.
nkozyra
Playwright was miles ahead of selenium but what I think is really overlooked is chromedp
jpyles
Luckily, I have some experience with playwright, so swapping shouldn't take me too long.
Currently working on a PR to swap over
jpyles
With the custom headers, you can actually trick a lot of sites with bot protection to let you load their sites (even big sites like youtube, which I have found success in)
dotancohen
How do you work around pop-ups for newsletters and such? Look at the BBC for a good example.
anxman
Pack ad blockers into your containers. They can be loaded into Chrome and help immensely in suppressing popovers while crawling.
throwaway81523
Last time I looked, Selenium was able to use Firefox. IDK about Playwright, but Puppeteer was Chrome-only.
stuffoverflow
Playwright supports Firefox, chromium and webkit
Funny, I saw this HN headline just after banning another scraper's IP range
You're welcome to scrape my sites but please do it ethically. Idk how to define that but some examples of things I consider not cool:
- Scraping without a contact method, or at least some unique identifier (like your project's codename), in the user agent string.
This is common practice, see e.g.: <https://en.wikipedia.org/wiki/User-Agent_header#Format_for_a...>. Many sites mention in public API guidelines to include an email address so you can be contacted in case of problems. If you don't include this and you're causing trouble, all I can do is ban your IP address altogether (or entire ranges: if you hop between several IPs I'll have to assume you have access to the whole range). Nobody likes IP bans: you have to get a new IP, your provider has a burned IP address, the next customer runs into issues... don't be this person, include an identifier.
- Timing out the request after a few seconds.
Some pages on my site involve number crunching and take 20 seconds to load. I could add complexity to do this async instead, but, by having it live, the regular users get the latest info and they know to just wait a few seconds and everybody is happy. Even the scrapers can get the info, I'm fine computing those pages for you. But if you ask for me to do work and then walk away, that's just rude. It shows up in my logs as HTTP status 499 and I'll ban scrapers that I notice doing this regularly
- Ignoring robots.txt.
I have exactly 1 entry in there, and that's a caching proxy for another site that is struggling with load. If you ignore the robots file and just crawl the thing from A to Z at a high rate, that causes a lot of requests to the upstream site for updating stale caches. You can obviously expect a ban because it's again just a waste of resources