Skip to content(if available)orjump to list(if available)

SpaCy: Industrial-Strength Natural Language Processing (NLP) in Python

binarymax

I’ve been a user of SpaCy since 2016. I haven’t touched it in years and I just picked it up again to develop a new metric for RAG using part of speech coverage.

The API is one of the best ever, and really set the bar high for language tooling.

I’m glad it’s still around and getting updates. I had a bit of trouble integrating it with uv, but nothing too bad.

Thanks to the explosion team for making such an amazing project and keeping it going all these years.

To the new “AI” people in the room: checkout SpaCy, and see how well it works and how fast it chews through text. You might find yourself in a situation where you don’t need to send your data to OpenAI for some small things.

Edit: I almost forgot to add this little nugget of history: one of Huggingfaces first projects was a SpaCy extension for conference resolution. Built before their breakthrough with transformers https://github.com/huggingface/neuralcoref

null

[deleted]

patrickhogan1

SpaCy was my go to library for NER before GPT 3+. It was 10x better than regex (though you could also include regex within your pipelines.

Its annotation tooling was so far ahead. It is still crazy to me that so much of the value in the data annotation space went to Scale AI vs tools like SpaCy that enabled annotation at scale in the enterprise.

bratao

I'm really curious about the history of spaCy. From my PoV: it grew a lot during the pandemic era, hiring a lot of employees. I remember something about raising money for the first time. It was very competitive in NLP tasks. Now it seems that it has scaled back considerably, with a dramatic reduction in employees and a total slowdown of the project. The v4 version looks postponed. It isn't competitive in many tasks anymore (for tasks such as NER, I get better results by fine-tuning a BERT model), and the transformer integration is confusing.

binarymax

I’ve had success with fine tuning their transformer model. The issue was that there was only one of them per language, compared to huggingface where you have a choice of many of quality variants that best align with your domain and data.

The SpaCy API is just so nice. I love the ease of iterating over sentences, spans, and tokens and having the enrichment right there. Pipelines are super easy, and patterns are fantastic. It’s just a different use case than BERT.

skeptrune

SpaCy is criminally underrated. I expect to see it experience a new wave of growth as folks new to AI start to realize all of the language tooling they need to build more reliable "traditional" ML pipelines.

API surface is designed well and it's still actively maintained almost 10 years after it initially went public.

chpatrick

Is there any use case for "traditional" NLP in the age of LLMs?

skeptrune

Most definitely! LLMs are amazing tools for generating synthetic datasets that can be used alongside traditional NLP to train things like decision trees with libraries like cat/xgboost.

I have a search background so learning to rank is always top of mind for me, but there other places like sentiment analysis, intent detection, and topic classification where it's great too.

binarymax

Some low hanging fruit: SpaCy makes an amazing chunking tool for preprocessing text for LLMs.

roadside_picnic

A friend, who also has a background in NLP, was asking me the other day "Is there still even a need for traditional NLP in the age of LLMs?"

This is one of the under-discussed areas of LLMs imho.

For anything that would have have required either word2vec embeddings of a tf-idf representation (classification tasks, sentiment analysis, etc) there are rare exceptions where it wouldn't just be better to start with a semantic embedding from an LLM.

For NER and similar data extraction tasks, the only advantage of traditional approaches is going to be speed, but my experience in practice is that accuracy is often much more important than speed. Again, I'm not sure why not start with an LLM in these cases.

There are still a few remaining use cases (PoS tagging comes to mind), but honestly, if I have a traditional NLP task today, I'm pretty sure I'm going to start with an LLM as my baseline.

giantg2

What are the key differences from other NLP Python libraries?

jihadjihad

Speed (the C in spaCy). A decade ago it was hard to find anything actually production grade for NLP, most packages had an academic bent or were useful for prototyping. SpaCy really changed the game by being able to run performant NLP on standard hardware.