Skip to content(if available)orjump to list(if available)

It seems that OpenAI is scraping [certificate transparency] logs

827a

Thousands of systems, from Google to script kiddies to OpenAI to nigerian call scammers to cybersecurity firms, actively watch the certificate transparency logs for exactly this reason. Yawn.

ekr____

With that said, given that (1) pre-certificates in the log are big and (2) lifetimes are shortening and so there will be a lot of duplicates, it seems like it would be good for someone to make a feed that was just new domain names.

827a

These exist for apex domains; the real use-case is subdomains.

ekr____

Sure, but the subdomains will be duplicated for the same reasons.

H8crilA

For those that never looked at the CT logs: https://crt.sh/?q=ycombinator.com

(the site may occasionally fail to load)

pavel_lishin

What's the yawn for?

jfindper

It implies that this is boring and not article/post-worthy (which I agree with).

Certificate transparency logs are intended to be consumed by others. That is indeed what is happening. Not interesting.

moralestapia

Because it's hardly news in its context.

xpe

Presumably this is well-known among people that already know about this.

P.S. In the hopes of making this more than just a sarcastic comment, the question of "How do people bootstrap knowledge?" is kind of interesting. [1]

> To tackle a hard problem, it is often wise to reuse and recombine existing knowledge. Such an ability to bootstrap enables us to grow rich mental concepts despite limited cognitive resources. Here we present a computational model of conceptual bootstrapping. This model uses a dynamic conceptual repertoire that can cache and later reuse elements of earlier insights in principled ways, modelling learning as a series of compositional generalizations. This model predicts systematically different learned concepts when the same evidence is processed in different orders, without any extra assumptions about previous beliefs or background knowledge. Across four behavioural experiments (total n = 570), we demonstrate strong curriculum-order and conceptual garden-pathing effects that closely resemble our model predictions and differ from those of alternative accounts. Taken together, this work offers a computational account of how past experiences shape future conceptual discoveries and showcases the importance of curriculum design in human inductive concept inferences.

[1]: https://www.nature.com/articles/s41562-023-01719-1

irishcoffee

Everyone does it, it’s no big deal. “Yes officer I was speeding, so was everyone else!”

Gross.

jfindper

The intended purpose of certificate transparency logs is to be viewed by others!

Perhaps you should save your "gross" judgement for when you better understand what's happening?

tsimionescu

The whole point of the CT logs is to be a public list of all domains which have TLS certs issued by the Web PKI. People are reading this list. I really don't see what is either surprising or in any way problematic in doing so.

edvinbesic

You are implying that a law is being broken, but isn't this the equivalent of going to city hall to pull public land records?

formerly_proven

The whole point of CT logs is to make issuance of certificates in the public WebPKI… public.

bombcar

If you somewhat want to avoid this, get a wildcard certificate (LE supports them: https://community.letsencrypt.org/t/acme-v2-production-envir...

Then all they know is the main domain, and you can somewhat hide in obscurity.

lysace

Unfortunately they are a bit extra bothersome to automate (depending on your DNS provider/setup) because of the DNS CNAME-method validation requirement.

jsheard

Yep, but next year they intend to launch a new challenge type which doesn't require write access to your DNS records every time it renews. Just add a public key to your DNS once and you're done.

Aurornis

This could be OpenAI, or it could be another company using their header pattern.

It has long been common for scrapers to adopt the header patterns of search engine crawlers to hide in logs and bypass simple filters. The logical next step is for smaller AI players to present themselves as the largest players in the space.

Some search engines provide a list of their scraper IP ranges specifically so you can verify if scraper activity is really them or an imitator.

EDIT: Thanks to the comment below for looking this up and confirming this IP matches OpenAI’s range.

jsheard

In this case it is actually OpenAI, the IP (74.7.175.182) is in one of their published ranges (74.7.175.128/25).

https://openai.com/searchbot.json

I don't know if imitating a major crawler is really worth it, it may work against very naive filters, but it's easy to definitively check whether you're faking so it's just handing ammo to more advanced filters which do check.

  $ curl -I https://www.cloudflare.com
  HTTP/2 200

  $ curl -I -H "User-Agent: Googlebot" https://www.cloudflare.com
  HTTP/2 403

Aurornis

Thanks for looking it up!

basilikum

They definitely do. Before this comment CT logs – aside from DNS queries – were the only way to know about https://onion.basilikum.monster and you have to send the hostname in the SNI, otherwise you get another certificate back.

Of course I get some bot traffic including the OpenAI bot – although I just trust the user agent there, I have not confirmed that the origin IP address of the requests actually belongs to OpenAI.

That's just the internet. I like the bots, especially the wp-admin or .env ones. It's fun watching them doing their thing like little ants.

throwaway613745

OpenAI is scraping everything that is publicly accessible. Everything.

Aachen

Yet they provide the user agents and IP address ranges which they scrape from, and say they respect robots.txt

I run a web server and so see a lot of scrapers, but OpenAI is one of the ones that appear to respect limits that you set. A lot of (if not most) others don't even have that ethics standard so I'd not say that "OpenAI scrapes everything they can access. Everything" without qualification, as that doesn't seem to be true, at least not until someone puts a file behind a robots deny page and finds that chatgpt (or another of openai's products) has knowledge of it

jcims

Given these are trivially forged, presumably they aren't really using a Mac for scraping, right? Just to elicit a 'standard' end user response from the server?

>useragent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; robots.txt;

snowwrestler

Right. Crawler user agent strings in general tend to include all sorts of legacy stuff for compatibility.

This actually is a well-behaved crawler user agent because it identifies itself at the end.

Hrun0

Yes, it is very common to change your useragent for web scraping. Mainly because there are websites which will block you just based on that alone

benjojo12

the ip address the this comes from is a OpenAI search bot range:

> "ipv4Prefix": "74.7.175.128/25"

from https://openai.com/searchbot.json

_pdp_

I wonder if this can be used to contaminate OpenAI search indexes?

xpe

Meta: People are quite excellent at jumping to conclusions or assuming their POV is the only one.

Consider this simplified scenario.

    - X happened
    - Person P says "Ah, *X* happened."
    - Person Q *interprets* this in a particular way
      and says "Stop saying X is BAD!"
    - Person R, who already knows about X...
      (and indifferent to what others notice
       or might know or be interested in)
      ...says "(yawn)".
This failure mode is incredibly common. And preventable. What a waste of the collective hours of attention and thinking we are spending here that we could be using somewhere else.

See also: the difference between positive and normative; charitable interpretations; not jumping to conclusions

mxlje

So? It’s public information and a somewhat easily consumable stream of websites to scrape, if my job was to scrape the entire internet I’d probably start there, too.

gmerc

Let's prompt inject it

drwhyandhow

This has been long the case! I think there whole business model is based off scraping lol