Launch HN: Exa (YC S21) – The web as a database

76 comments

·May 6, 2025

Hey HN! We’re Will and Jeff from Exa (https://exa.ai). We recently launched Exa Websets, an embeddings-powered search engine designed to return exactly what you’re asking for. You can get precise results for complex queries like “all startups working on open-source developer tools based in SF, founded 2021-2025”. Demo here - https://youtu.be/Unt8hJmCxd4

We started working on Exa because we were frustrated that while LLM state-of-the-art is advancing every week, Google has gotten worse over time. The Internet used to feel like a magical information portal, but it doesn’t feel that way anymore when you’re constantly being pushed towards SEO-optimized clickbait.

Websets is a step in the opposite direction. For every search, we perform dozens of embedding searches over Exa’s vector database of the web to find good search candidates, then we run agentic workflows on each result to verify they match exactly what you asked for.

Websets results are good for two reasons. First, we train custom embedding models for our main search algorithm, instead of typical keyword matching search algorithms. Our embeddings models are trained specifically to return exactly the type of entity you ask for. In practice, that means if you search “startups working in nanotech”, keyword-based search engines return listicles about nanotech startups, because these listicles match the keywords in the query. In contrast, our embedding models return actual startup homepages, because these startup homepages match the meaning of the query.

The second is that LLMs provide the last-mile intelligence needed to verify every result. Each result and piece of data is backed with supporting references that we used to validate that the result is actually a match for your search criteria. That’s why Websets can take minutes or even hours to run, depending on your query and how many results you ask for. For valuable search queries, we think this is worth it.

Also notably, Websets are tables, not lists. You can add “enrichment” columns to find more information about each result, like “# of employees” or “does author have blog?”, and the cells asynchronously load in. This table format hopefully makes the web feel more like a database.

A few examples of searches that work with Websets:

- “Math blogs created by teachers from outside the US”: https://websets.exa.ai/cma1oz9xf007sis0ipzxgbamn

- "research paper about ways to avoid the O(n^2) attention problem in transformers, where one of the first author's first name starts with "A","B", "S", or "T", and it was written between 2018 and 2022”: https://websets.exa.ai/cm7dpml8c001ylnymum4sp11h

- “US based healthcare companies, with over 100 employees and a technical founder": https://websets.exa.ai/cm6lc0dlk004ilecmzej76qx2

- “all software engineers in the Bay Area, with experience in startups, who know Rust and have published technical content before”: https://youtu.be/knjrlm1aibQ

You can try it at https://websets.exa.ai/ and API docs are at https://docs.exa.ai/websets. We’d love to hear your feedback!

Visit

AznHisoka

I searched for 'data providers that start with the letter R that sell job postings data', and it's been 15 minutes and it barely verified the first row.

But if it filtered it first to "start with the letter R", it would only have to look at perhaps 5% of the results it's trying to verify!

So it's doing needless verification of results that will be thrown out by another filter that should've been applied first!

xp84

This is super cool! It took a while, but did a great job of evaluating the results, and the airtable-like results UI feels great.

Congrats on your launch. With the natural way this lends itself to comparison shopping this is an amazing tool for people trying to find "the best X for me" whether that's a TV, a school, etc. So much content that you find on Google when trying to answer that type of query, is designed to trick, bamboozle, and to hide the facts that you might use to answer this question (but most of all to get you to click affiliate links).

byearthithatius

I was so excited for this, but sadly it doesn't work at all, not even UI feedback for the error:(

The UI showed literally no change. So I checked and the console shows:

``` Try: 14 Not Found 681-7df1b139fa2dc9f0.js:14:3379 Try: 15 Not Found 681-7df1b139fa2dc9f0.js:14:3379 Try: 16 Not Found 681-7df1b139fa2dc9f0.js:14:3379 Try: 17 Not Found 681-7df1b139fa2dc9f0.js:14:3379 Try: 18 Not Found 681-7df1b139fa2dc9f0.js:14:3379 Try: 19 Not Found 681-7df1b139fa2dc9f0.js:14:3379 Try: 20 Not Found 681-7df1b139fa2dc9f0.js:14:3379 Gave up after 10 seconds. 681-7df1b139fa2dc9f0.js:14:3379 filteredSuggestions Array(3) [ {…}, {…}, {…} ] 681-7df1b139fa2dc9f0.js:14:3379 ```

Also your table doesn't fit in the viewport so I can't see the results.

Firefox Ubuntu.

tibbar

I also thought the UX had silently died on me, but over the course of a few hours, results slowly rolled in. And they were pretty good, for what it's worth! It's clear they have far more demand than supply, at least than can be reasonably offered for free.

pilingual

When OpenAI was rumored to acquire Windsurf last week I went to their site and switched languages. When I tried to switch back it got into a weird state and didn't display the original language. Not sure what to think of that other than vibe coding may need a little more oversight. (Who is working on AI QA? Winning pickaxe and shovel business right there.)

joshstrange

I think it might be a good idea to give some kind of indication that work is being done in the background (or perhaps mine stalled out?).

The initial search/experience is good but then I got dumped here [0] and it's not clear to me if things are still happening or if it broke (it's been at least 5 min with no UI updates.

I can't see the full results yet but this is very interesting and a task I ask OpenAI's Deep Research to attempt periodically. It makes a good show of doing the work but the results are not great IMHO (for asking it generate lists/tables of data like this). I can see this tool being incredibly useful for lead generation (how I am testing it out).

[0] https://cs.joshstrange.com/dySqK1mb

hubraumhugo

I think you guys nailed the "selling shovels during a gold rush" as the biggest issue with LLMs currently is their reliability/hallucinations, not their capabilities. If I can use websets to back up LLM responses through your API, that's super useful.

Since you were part of YC 21, could you share a bit about your pivots/product iterations you went through over the last 4 years?

willbryk

Mission of Exa has always been to build much better web search. The evolution has been:

- 2022: Consumer-facing embeddings search (back when we were known as Metaphor)

- 2023: Web search for AIs - once the AI ecosystem heated up, we made a business out of web search + crawling API. This is still our primary business.

- Now: Websets, a useful product built on top of our search tech

If you're curious, our company right now is fully devoted to:

1. Dramatically improving Websets quality

2. Building the best general search engine in the world

gavinward

It must be heartening for a startup trying to build the best general search engine in the world to know that Google has absolutely no interest in competing with you.

willbryk

Because Google makes money from ads, they're not actually optimized to build the best general search engine in the world, they're optimized to build the search engine that makes the most from ads, which is correlated with being a good search engine but not perfectly aligned. Our business model (paying directly for the search) incentivizes us to try to return the highest quality results, without any bias toward making money from ads. It also enables us to do things like pour a ton of compute/resources into a query to get the best possible results we can find, because someone would pay us a lot for that, and that's hard to do under an ads-based model.

theamk

Did my favorite search query, and the result were pretty bad, as expected:

"robotics servo motors with two-directional control for under $100"

1. https://mjbots.com/ - their motor are $1369. FAIL.

2. https://www.pololu.com/ - this is huge store, but they do have some motors like that. Pass, but wish it linked to specific page and not top top-level one.

3. dh-robotics.com - no prices, but some products on open market are few K$. Likely fail as well.

4. https://www.robotarticulation.com/ - The product is not for sale (early beta), and it looks likely much more than $1K. FAIL.

5. https://www.lynxmotion.com/ - another huge store, most two-directional motors are expensive but there are some under $100... Pass, but wish it linked to specific page and not top top-level one.

85392_school

It sounds like they have yet to focus on products yet. Loosely quoting https://news.ycombinator.com/item?id=43907634:

> So the search should work best for people, companies, papers, high quality written content.

> Types of searches Websets doesn't currently do well at: products, content that requires authentication/permissions to access, and non-English content

mbeavitt

This is super cool. You provide examples of “searches that work” - can you give an idea of the limitations here? What kind of searches won’t work?

willbryk

We're a startup, so most of our resources go towards use cases that our users care most about. So the search should work best for - people, companies, papers, high quality written content (e.g., blogs, news). It should work well at more than just those (try Github repo search, it's quite good :D), but those are the best supported.

Types of searches Websets doesn't currently do well at: - products (e.g., ecommerce sites) - Content that requires authentication/permissions to access - non-English content

Some of the above are on our roadmap, and let us know if there's some type of data you'd like us to support!

colkassad

Geospatial data would be great. This stuff is notoriously annoying to search for. For example:

"Give me a list of free imagery service endpoints I can use in a maplibre style sheet. Include information such as name, description, service endpoint, service type, extent (global/regional)."

willbryk

This might be possible if you specify geospatial location as an enriched column. The visualization of it as a map though is not supported in the UI, but can be built by giving an LLM access to the Websets API

dbuxton

Hey! Congrats on the launch. I just signed up for a trial account and I’m pretty impressed with the search API (haven’t used websets yet but looks cool).

Our experimental use case is enabling quick and dirty integration of web-based docs into an employee service agentic chatbot - lots of the questions are around “how do I max out my 401k”, which connects to internal information, but some are more like “how do I link a calendar to calendly”.

The one thing I’d love to have in the search product is a cruft cleaner for the results of web queries. Where you have cached the data presumably this wouldn’t add much overhead. Reduces what you have to feed to the LLM downstream and might improve the embeddings performance.

willbryk

By cruft cleaner, do you mean cleaning the HTML well? Right now, we do 2 things to help with that, a pretty robust parsing stack as well as a "summaries" feature that returns an LLM-generated query-biased text output for every webpage returned.

If something else though, curious.

jackienotchan

AI crawlers have lead to a big surge in scraping/crawling activity on the web, and many don't use proper user agents and don't stick to any scraping best practices that the industry has developed over the past two decades (robots.txt, rate limits). This comes with negative side effects for website owners (costs, downtime, etc.), as repeatedly reported on HN (and experienced myself).

Do you have any built-in features that address these issues?

antoniojtorres

I work in the adtech ad verification space and this is very true. the surge in content scraping has made things very very hard in some instances. I can’t really fault the website owners either.

frankramos

The Exa LinkedIn webset is something very innovative. Many current providers make it difficult if not against "Terms of Service" to build a product using their data. The irony is that they simply scraped LinkedIn.

srameshc

So the crawlers are feeding to database and also something is classifying the data stream and organizing the data and everything is open as a very large dataset. This is an interesting concept.

cobertos

What your describe is the same concept as what https://hash.ai purports to be

willbryk

Yup exactly! And we expose this as a regular search API as well as in the Websets product.

mfrye0

Congrats on the launch!

How do you dedupe entities, like companies and people? I've noticed ChatGPT tends to provide "great" results when asking about different entities, but in reality it just groups similar sounding entities together in its answer.

For example, I asked ChatGPT about a well known startup. It gave me a confident answer about how much they raised, their current status, etc. When looking at the 3 sources they cited though, it was actually 3 different companies that all had similar sounding names that it just grouped together to form its answer.

Basically, how do I trust the output of your system?

liam-hinzman

We find supporting references when evaluating the search criteria / enrichments of each result, and you can view these citations

https://imgur.com/dsGK5dS

mfrye0

Right, I saw that. ChatGPT does the same.

My question is how you can confirm the entity you're referencing in each source is actually the entity you're looking for?

An example I ran into recently is Vast (https://www.vastspace.com/). There are a number of other notable startups named Vast (https://vast.ai/, https://www.vastdata.com/).

I understand Clay, which your Websets product is clearly inspired by, does a fair amount of matching based on domain name or LinkedIn url.

If Websets is doing fuzzy or naive matching, that's okay. I'm just trying to understand the limitations and potential uses cases of your current system.

liam-hinzman

Deduplication is mainly driven by LLMs with search results as context. Our entity resolution works well because Exa’s main business is crawling and indexing the web at scale, and we can control how we search across that within Websets.

As far as I know ChatGPT’s search is primarily a wrapper around another company’s search engine, which is why it often feels like it’s just summarizing a page of search results and sometimes hallucinates badly.