3 Spicy Tricks for Efficient Text Processing and Entity Recognition

# Introduction

Especially thanks to contemporary large language models, natural language processing (NLP) is a fundamental pillar of modern AI and software systems. You’ll find NLP techniques and technologies powering everything from search engines and chatbots to automated customer support routing and entity extraction pipelines. When it comes to production-grade NLP in Python, spacey The undisputed industry standard. SpaCy is designed specifically for production use, offering industrial-strength speed, pre-trained statistical and Transformer models, and an intuitive API.

Unfortunately, many developers treat SpaC as a simple black box monolith. They load a model, run it on the text, and accept the default processing speed and extraction limits. When scaling from local prototypes to processing millions of documents, these default configurations can become computational bottlenecks, causing latency, bloated memory footprint, and missed domain-specific entities. To build high-performance text processing pipelines, you need to understand how to optimize Spacy’s internal execution flow.

In this article, we’ll explore three essential Spacy tricks that every developer should have in their toolkit to maximize processing speed and optimize entity recognition: selective pipeline loading, parallel batch processing, and hybrid rule-based statistical entity recognition.

Before you begin, make sure you have installed SpaCy, as well as its lightweight general-purpose English model:

pip install spacy
python -m spacy download en_core_web_sm

# 1. Disabling selective pipeline loading and components

By default, when you load a pre-trained Spacy model (e.g. en_core_web_sm), Spacy starts a full NLP pipeline. This pipeline typically includes:

a tokenizer
Part of speech tagger (tagger)
a dependency parser (parser)
a lemmatizer (lemmatizer)
a characteristic ruler (attribute_ruler)
A named entity identifier (ner)

Although this full default rich feature set is excellent, it comes with substantial computational overhead. If your application only needs to perform Named Entity Recognition (NER), running the dependency parser and lemmatizer is a waste of CPU cycles and memory. In contrast, if you are just cleaning text and extracting lemmas, running deep statistical NER models is highly inefficient. You can customize this by selectively excluding components during loading, or temporarily disabling them during execution using the context manager.

This simple approach loads and runs each default component on the text, regardless of whether the components’ outputs are actually used:

import spacy
import time

# Load the small English model
nlp = spacy.load("en_core_web_sm")

texts = ("Apple is looking at buying U.K. startup for $1 billion") * 1000

# Naive execution: runs tagger, parser, lemmatizer, and ner on every doc
# Assume we only care about named entities here
start_time = time.time()
for text in texts:
    doc = nlp(text)
    entities = ((ent.text, ent.label_) for ent in doc.ents)

duration_full = time.time() - start_time

print(f"Full pipeline processed 1,000 docs in: {duration_full:.4f} seconds")

Output:

Full pipeline processed 1,000 docs in: 2.8540 seconds

Let’s now optimize execution in two specific ways. First, we’ll exclude heavy, unused components like the dependency parser at load time. Second, we will use nlp.select_pipes() Temporarily disabling components while processing specific workloads.

import spacy
import time

# Load time optimization: Exclude the heavy parser and tagger from the start
# This reduces initialization time and memory footprint
nlp_optimized = spacy.load("en_core_web_sm", exclude=("parser", "tagger"))

texts = ("Apple is looking at buying U.K. startup for $1 billion") * 1000

# Context-manager optimization, disable components temporarily
# We have outright excluded parser and tagger, we disable attribute ruler and lemmatizer here
start_time = time.time()
with nlp_optimized.select_pipes(disable=("attribute_ruler", "lemmatizer")):
    for text in texts:
        doc = nlp_optimized(text)
        entities = ((ent.text, ent.label_) for ent in doc.ents)

duration_opt = time.time() - start_time

print(f"Optimized pipeline processed 1,000 docs in: {duration_opt:.4f} seconds")
print(f"Speedup: {duration_full / duration_opt:.2f}x faster!")

Let’s compare the runtime:

Full pipeline processed 1,000 docs in: 2.8739 seconds
Optimized pipeline processed 1,000 docs in: 1.7859 seconds
Speedup: 1.61x faster!

In the adapted example, passing exclude=("parser", "tagger") To spacy.load() Completely prevents these components from being loaded into memory. In an alternative method to reach essentially the same result, we passed disable=("attribute_ruler", "lemmatizer") Temporarily disabling their processing. The effect is that, when we process text, SpaCy skips token dependency analysis and part-of-speech tag labeling, which are mathematically expensive, and goes straight to entity identification. This results in remarkable speedups with zero impact on NER accuracy, with even more noticeable benefits at larger scales.

# 2. High-Throughput Batch Processing with NLP.Pipe and Metadata Diffusion

If you’re iterating over a large corpus (e.g. pandas dataframes, database rows, or raw text files), call nlp objects on different strings in a loop (e.g. (nlp(text) for text in texts)) is an anti-pattern.

Sequential processing prevents Spacy from optimizing memory buffers, grouping operations, and taking advantage of multi-core parallelization. Also, when processing text for database storage or ETL pipelines, you often need to take metadata (such as record ID, timestamp, or category) through an NLP process so that you can map the resulting entities back to the correct database rows.

The solution is to use nlp.pipe(). This method processes documents as StreamBuffers them internally, and supports multi-processing. by setting as_tuples=Trueyou can feed tuples (text, context) For Spicy. it will come back (doc, context) Add, letting you pass metadata directly through the pipeline.

This simple approach runs the processing sequentially and uses manual index tracking to align the resulting documents with their database IDs, which is brittle and slow:

import spacy
import time

nlp = spacy.load("en_core_web_sm", exclude=("parser", "tagger"))

# Raw database records with unique IDs
records = (
    {"id": f"DB-REC-{i}", "text": "Google was founded in September 1998 by Larry Page and Sergey Brin."}
    for i in range(1000)
)

# Sequential loop: slow and manually managed metadata
start_time = time.time()
extracted_data = ()
for i, record in enumerate(records):
    doc = nlp(record("text"))
    entities = ((ent.text, ent.label_) for ent in doc.ents)
    extracted_data.append({
        "id": record("id"),
        "entities": entities
    })

duration_seq = time.time() - start_time

print(f"Sequential loop processed 1,000 docs in: {duration_seq:.4f} seconds")

Output:

Sequential loop processed 1,000 docs in: 2.7375 seconds

Here, we stream the data using nlp.pipeTaking advantage of batch processing and multi-core parallelization (n_process), while letting the database ID run as a reference variable:

import spacy
import time

# Keep your imports and definitions global so child processes can see them
nlp = spacy.load("en_core_web_sm", exclude=("parser", "tagger"))

# Wrap the actual execution code in the main block
if __name__ == '__main__':
    records = (
        {"id": f"DB-REC-{i}", "text": "Google was founded in September 1998 by Larry Page and Sergey Brin."}
        for i in range(1000)
    )

    start_time = time.time()

    # Format input as a list of (text, context) tuples
    stream_input = ((rec("text"), rec("id")) for rec in records)

    # Stream batches and use all available CPU cores with n_process=-1
    extracted_data_pipe = ()
    docs_stream = nlp.pipe(stream_input, as_tuples=True, batch_size=256, n_process=-1)

    for doc, rec_id in docs_stream:
        entities = ((ent.text, ent.label_) for ent in doc.ents)
        extracted_data_pipe.append({
            "id": rec_id,
            "entities": entities
        })

    duration_pipe = time.time() - start_time

    print(f"nlp.pipe processed 1,000 docs in: {duration_pipe:.4f} seconds")
    print(f"Speedup: {duration_seq / duration_pipe:.2f}x faster!")

Output:

nlp.pipe processed 1,000 docs in: 7.1310 seconds

In the optimized code snippet, we reorganize the input dataset into a sequence of tuples: (text_string, metadata_context). while calling nlp.pipe(stream_input, as_tuples=True, batch_size=256, n_process=-1): :

batch_size=256 Tells Spacy to buffer and process text in groups of 256, reducing internal Python loop overhead
n_process=-1 Tells Spacy to automatically detect your system’s CPU count and parallelize tokenization and component extraction across all available cores.
as_tuples=True Instructs Spacy to generate pairs (doc, context)Ensuring that metadata (record IDs) remain perfectly aligned with the processed document without the need for manual index arrays or list-alignment codes

The astute reader will notice that the processing time for the parallel batch processing code has actually increased compared to its predecessor. However, this is due to the overhead associated with setting up parallel tasks, and the savings will become apparent as the number of documents to process increases.

Re-running the same code excerpts above, but with 10,000 records instead of 1,000, here are the results:

Sequential loop processed 1,000 docs in: 27.6733 seconds
nlp.pipe processed 1,000 docs in: 11.5444 seconds

You can see how the savings will keep compounding.

# 3. Hybrid with named entity recognition `EntityRuler`

Pre-trained statistical and transformer-based NER models are incredibly powerful for recognizing common entity types ORG, PERSONOr DATE Based on context. However, models may often fail to recognize domain-specific terms (such as custom product SKUs, legacy code IDs, or highly specialized medical terms) because they were not exposed to them during training.

Fixing deep learning statistical models on custom entities is one solution, but it requires labeling thousands of sentences and runs the risk of “catastrophic forgetting”, in which the model forgets to recognize standard entities along the way.

A clean, highly efficient solution is a hybrid NER approach using Spacy EntityRuler. EntityRuler Allows you to define patterns (using regular expressions or token-based dictionary dictionaries) and inject them directly into your pipeline. you can add this First Statistical NER – to pre-tag deterministic entities and help the model make context decisions – or after This – acting as a fallback or override.

Developers often try to fix statistical NER lag by running regex on text after Running spacy pipeline, resulting in manual coordinate offset math and disconnected data structures:

import spacy
import re

nlp = spacy.load("en_core_web_sm")
text = "Please review system ticket ID: TKT-98421 on our corporate portal."

doc = nlp(text)

# Standard statistical NER misses custom ticket IDs
entities = ((ent.text, ent.label_) for ent in doc.ents)
print("Before post-process:", entities)

# Post-process regex patch
ticket_pattern = r"TKT-\d+"
matches = re.finditer(ticket_pattern, text)
custom_ents = ()
for match in matches:
    # Requires complex char-to-token offset conversion to build spans
    custom_ents.append((match.group(), "TICKET_ID"))

# We now have two disconnected lists of entities that must be merged manually
print("Regex entities:", custom_ents)

Output:

Before post-process: ()
Regex entities: (('TKT-98421', 'TICKET_ID'))

by adding one EntityRuler In the component pipeline directly, we merge rule-based regex patterns and statistical parsing into a single, unified doc.ents Output:

import spacy

nlp = spacy.load("en_core_web_sm")

# Add the entity_ruler component to the pipeline before ner so it pre-tags entities, but after works too
ruler = nlp.add_pipe("entity_ruler", before="ner")

# Define token-level patterns, including regular expressions
patterns = (
    # Match strings starting with "TKT-" followed by digits
    {"label": "TICKET_ID", "pattern": ({"TEXT": {"REGEX": "^TKT-\d+$"}})},
    # Match specific domain phrases exactly
    {"label": "ORG", "pattern": "corporate portal"}
)
ruler.add_patterns(patterns)

text = "Please review system ticket ID: TKT-98421 on our corporate portal."
doc = nlp(text)

# Both statistical and rule-based entities are consolidated inside doc.ents
for ent in doc.ents:
    print(f"Entity: {ent.text:<20} | Label: {ent.label_}")

Output:

Entity: TKT-98421            | Label: TICKET_ID
Entity: corporate portal     | Label: ORG

In this hybrid implementation, we call nlp.add_pipe("entity_ruler", before="ner"). EntityRuler Serves as a basic pipeline component. When text is processed:

Tokenizer divides the sentence into tokens.
EntityRuler The first run, identifies tokens that match our ticket regex pattern or exact dictionary string and tags them TICKET_ID Or ORG.
statistics ner The component moves next. Because it sees that these tokens are already tagged as entities, it respects the tags (or adapts its predictions around them, avoiding collisions).

This ensures that all entities, both learned statistical ones and deterministic rule-based ones, coexist in a single, harmonious way. Doc.ents Sequencing eliminates the need for brittle post-process sorting or offset adjustments.

# wrapping up

Customizing Spacy is about transitioning from default configurations to pipelines that respect your system resources and domain-specific requirements.

By adopting these three tricks, you can design a highly efficient, production-grade text processing pipeline:

Selective loading and component disabling eliminates unnecessary calculations, increasing your processing speed up to 5x.
batch processing with nlp.pipe Parallelizes execution across CPU cores and settings as_tuples=True Index-mapping disseminates important metadata without bugs.
with hybrid NER EntityRuler Blends deterministic pattern-matching rules with general statistical inference, ensuring maximum extraction accuracy for custom domains without retraining.

Deploying these design patterns ensures that your NLP pipeline remains scalable, memory-efficient, and tailored to the unique vocabulary of your business data.

Matthew Mayo (@mattmayo13) has a master’s degree in computer science and a graduate diploma in data mining. As Managing Editor of KDnuggets and statisticsand contributing editor Mastery in Machine LearningMatthew’s goal is to make complex data science concepts accessible. His professional interests include natural language processing, language models, machine learning algorithms, and exploration of emerging AI. He is driven by the mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.

3 Spicy Tricks for Efficient Text Processing and Entity Recognition

# Introduction

# 1. Disabling selective pipeline loading and components

# 2. High-Throughput Batch Processing with NLP.Pipe and Metadata Diffusion

# 3. Hybrid with named entity recognition EntityRuler

# wrapping up

Building a Semantic Search Engine and Open-State Classifier on the ResearchMath-14k Dataset

Basketball fans are disappointed after ESPN aired an AI slop version of NBA champion Tony Parker during the Finals.

Related Articles

Leave a Comment Cancel Reply

# 3. Hybrid with named entity recognition `EntityRuler`