It seems that OpenAI is scraping [certificate transparency] logs
32 comments
·December 15, 2025827a
ekr____
With that said, given that (1) pre-certificates in the log are big and (2) lifetimes are shortening and so there will be a lot of duplicates, it seems like it would be good for someone to make a feed that was just new domain names.
H8crilA
For those that never looked at the CT logs: https://crt.sh/?q=ycombinator.com
(the site may occasionally fail to load)
pavel_lishin
What's the yawn for?
jfindper
It implies that this is boring and not article/post-worthy (which I agree with).
Certificate transparency logs are intended to be consumed by others. That is indeed what is happening. Not interesting.
moralestapia
Because it's hardly news in its context.
xpe
Presumably this is well-known among people that already know about this.
P.S. In the hopes of making this more than just a sarcastic comment, the question of "How do people bootstrap knowledge?" is kind of interesting. [1]
> To tackle a hard problem, it is often wise to reuse and recombine existing knowledge. Such an ability to bootstrap enables us to grow rich mental concepts despite limited cognitive resources. Here we present a computational model of conceptual bootstrapping. This model uses a dynamic conceptual repertoire that can cache and later reuse elements of earlier insights in principled ways, modelling learning as a series of compositional generalizations. This model predicts systematically different learned concepts when the same evidence is processed in different orders, without any extra assumptions about previous beliefs or background knowledge. Across four behavioural experiments (total n = 570), we demonstrate strong curriculum-order and conceptual garden-pathing effects that closely resemble our model predictions and differ from those of alternative accounts. Taken together, this work offers a computational account of how past experiences shape future conceptual discoveries and showcases the importance of curriculum design in human inductive concept inferences.
irishcoffee
Everyone does it, it’s no big deal. “Yes officer I was speeding, so was everyone else!”
Gross.
jfindper
The intended purpose of certificate transparency logs is to be viewed by others!
Perhaps you should save your "gross" judgement for when you better understand what's happening?
tsimionescu
The whole point of the CT logs is to be a public list of all domains which have TLS certs issued by the Web PKI. People are reading this list. I really don't see what is either surprising or in any way problematic in doing so.
edvinbesic
You are implying that a law is being broken, but isn't this the equivalent of going to city hall to pull public land records?
formerly_proven
The whole point of CT logs is to make issuance of certificates in the public WebPKI… public.
bombcar
If you somewhat want to avoid this, get a wildcard certificate (LE supports them: https://community.letsencrypt.org/t/acme-v2-production-envir...
Then all they know is the main domain, and you can somewhat hide in obscurity.
lysace
Unfortunately they are a bit extra bothersome to automate (depending on your DNS provider/setup) because of the DNS CNAME-method validation requirement.
jsheard
Yep, but next year they intend to launch a new challenge type which doesn't require write access to your DNS records every time it renews. Just add a public key to your DNS once and you're done.
Aurornis
This could be OpenAI, or it could be another company using their header pattern.
It has long been common for scrapers to adopt the header patterns of search engine crawlers to hide in logs and bypass simple filters. The logical next step is for smaller AI players to present themselves as the largest players in the space.
Some search engines provide a list of their scraper IP ranges specifically so you can verify if scraper activity is really them or an imitator.
EDIT: Thanks to the comment below for looking this up and confirming this IP matches OpenAI’s range.
jsheard
In this case it is actually OpenAI, the IP (74.7.175.182) is in one of their published ranges (74.7.175.128/25).
https://openai.com/searchbot.json
I don't know if imitating a major crawler is really worth it, it may work against very naive filters, but it's easy to definitively check whether you're faking so it's just handing ammo to more advanced filters which do check.
$ curl -I https://www.cloudflare.com
HTTP/2 200
$ curl -I -H "User-Agent: Googlebot" https://www.cloudflare.com
HTTP/2 403Aurornis
Thanks for looking it up!
basilikum
They definitely do. Before this comment CT logs – aside from DNS queries – were the only way to know about https://onion.basilikum.monster and you have to send the hostname in the SNI, otherwise you get another certificate back.
Of course I get some bot traffic including the OpenAI bot – although I just trust the user agent there, I have not confirmed that the origin IP address of the requests actually belongs to OpenAI.
That's just the internet. I like the bots, especially the wp-admin or .env ones. It's fun watching them doing their thing like little ants.
throwaway613745
OpenAI is scraping everything that is publicly accessible. Everything.
Aachen
Yet they provide the user agents and IP address ranges which they scrape from, and say they respect robots.txt
I run a web server and so see a lot of scrapers, but OpenAI is one of the ones that appear to respect limits that you set. A lot of (if not most) others don't even have that ethics standard so I'd not say that "OpenAI scrapes everything they can access. Everything" without qualification, as that doesn't seem to be true, at least not until someone puts a file behind a robots deny page and finds that chatgpt (or another of openai's products) has knowledge of it
jcims
Given these are trivially forged, presumably they aren't really using a Mac for scraping, right? Just to elicit a 'standard' end user response from the server?
>useragent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; robots.txt;
snowwrestler
Right. Crawler user agent strings in general tend to include all sorts of legacy stuff for compatibility.
This actually is a well-behaved crawler user agent because it identifies itself at the end.
Hrun0
Yes, it is very common to change your useragent for web scraping. Mainly because there are websites which will block you just based on that alone
benjojo12
the ip address the this comes from is a OpenAI search bot range:
> "ipv4Prefix": "74.7.175.128/25"
_pdp_
I wonder if this can be used to contaminate OpenAI search indexes?
xpe
Meta: People are quite excellent at jumping to conclusions or assuming their POV is the only one.
Consider this simplified scenario.
- X happened
- Person P says "Ah, *X* happened."
- Person Q *interprets* this in a particular way
and says "Stop saying X is BAD!"
- Person R, who already knows about X...
(and indifferent to what others notice
or might know or be interested in)
...says "(yawn)".
This failure mode is incredibly common. And preventable. What a waste of the collective hours of attention and thinking we are spending here that we could be using somewhere else.See also: the difference between positive and normative; charitable interpretations; not jumping to conclusions
mxlje
So? It’s public information and a somewhat easily consumable stream of websites to scrape, if my job was to scrape the entire internet I’d probably start there, too.
gmerc
Let's prompt inject it
drwhyandhow
This has been long the case! I think there whole business model is based off scraping lol
Thousands of systems, from Google to script kiddies to OpenAI to nigerian call scammers to cybersecurity firms, actively watch the certificate transparency logs for exactly this reason. Yawn.