Beyond vectors: Why semantic search breaks on engineering documents, and how to fix it

By "semantic search" I mean vector search: documents and queries are each mapped to high-dimensional numerical vectors, and relevance is determined by how close those vectors are to each other.

§The promise and the trap of vector search

Vector search, often referred to simply as "semantic search", is everywhere. I wrote about it too, here. The promise: feed me a large corpus like code, documents, product descriptions, reviews; ask a question; and get back relevant answers. This often works very well, and not just for text but also images, speech, and video (you can even combine the modalities).

But in engineering, vector search can be problematic. That’s because in engineering, tiny textual differences can hide big semantic or logical differences: a threshold, a "not", a performance parameter, a safety condition, a verb flip ("engage" vs. "disengage"). And vector search smoothes over all that; it treats...

"shall engage autopilot when lateral deviation < 1 dot"

...and...

"shall disengage autopilot when lateral deviation < 1 dot"

...as near-neighbors.

That’s a bug.

In this post I explain how this happens. Then I propose how to build a more robust alternative: a hybrid symbolic + embedding + structured-RAG approach, tailored to engineering semantics.

GIVE ME STRUCTURE AND I WILL TELL YOU WHAT YOUR ENGINEERING TEXTS REALLY SAY

§How vector search fails

Consider our examples from above. To a vector search engine, they are nearly identical: same nouns ("autopilot", "lateral deviation"), same threshold ("< 1 dot"), same general structure. Yet semantically they are opposites: one says "turn the autopilot on", the other says "turn it off". A system that relies purely on vector similarity would cluster them together, and any downstream step may miss the conflict entirely.

Here’s another subtle case:

"Compute ETA so that error at destination is less than 30 seconds."
"Compute ETA so that error at destination is less than 300 seconds."

Again, vector-search-wise near-identical. Functionally, drastically different performance requirements.

Vector search collapses these differences because it smoothes text into a continuous semantic space. And this washes out the discrete pivots such as thresholds, negation, and situational context that definitely matter in engineering.

§What to do instead: symbolic semantics, not fluid similarity

Engineering documents are not essays. They are structured statements about:

triggers such as conditions, modes, states, and thresholds
actors and controlled objects such as subsystems, components, and interfaces
actions and effects such as enable, disable, compute, limit, signal, store, and reset
quantitative constraints such as timing, accuracy, tolerances, resource limits, and ranges
context such as operating modes, environmental conditions, and safety or reliability assumptions

These elements are categorical and discrete. Many are binary or numeric. Treating them as if they lived in a smooth similarity space is hazardous, because small changes often reverse meaning or shift the content into a different operating regime.

We need a representation that captures these structures explicitly rather than hoping vector similarity will keep them apart.

§A hybrid solution: symbolic extraction plus vector search plus structured RAG

Here is a procedure for a more reliable search and analysis stack for engineering content. It's mostly based on this paper from AI21.

§1. Extract a symbolic signature for each text

Start by taking a small sample of texts from the corpus, perhaps 30 or 50 texts (depending on text size, corpus topic diversity, etc.). From this sample, infer the structure that best represents the system described in the texts.

Only after you have this inferred structure do you parse the corpus. For each text, extract the parameters that fit the structure you just inferred. For example:

actors or components involved
trigger conditions such as variables, states, thresholds, or operating situations
actions or effects such as enable, disable, compute, limit
numeric constraints such as values, units, tolerances, or limits
context flags such as operating mode, environmental condition, or system state
interface or signal terminology, safety hints, and any other metadata that captures structured meaning

You can use LLMs for this but don't necessarily have to, at least not for the parameter extraction step. For large corpora, where LLMs might be prohibitively slow or expensive, you can get pretty far using a mix of parsing (e.g. SpaCy), regex, and small domain dictionaries.

§2. Represent each item as a signature plus an embedding

An embedding is a numerical representation of text that places sentences or passages in a high dimensional space where similar items end up close together. It's the representation that vector search uses to determine how similar any two pieces of text are. I wrote about this here.

For each text, retain both representations:

a structured symbolic signature
a vector embedding for broad topical similarity and fuzzy matching

§3. Use a hybrid similarity or distance metric

When clustering or retrieving, combine two kinds of signals:

symbolic similarity, for example whether triggers match, thresholds are close, polarity is the same, context is aligned, and actions are compatible or opposite
embedding similarity, which captures whether two items live in the same general content area or subsystem

Symbolic similarity should dominate. This prevents two texts that differ by a single threshold, operator, or action verb from being treated as equivalent.

You can also use the symbolic signatures for SQL or similar kinds of structured queries (cf. the AI21 paper). For example, you can query for all items where a variable exceeds a certain threshold, or where an action involves enabling or disabling a component, or where a specific operating mode appears, without relying on fuzzy text matching at all.

§4. Apply structured RAG for higher-level reasoning and synthesis

With symbolic signatures in place, you can:

group related items into clusters using the hybrid similarity metric
pick a few representative texts from each cluster and send them to a reasoning model, asking it to produce a structured summary of what that cluster seems to describe
express the result as a typed sketch in JSON or another structured format, capturing things like key actions, inputs, outputs, constraints, or relationships
use that sketch to navigate the corpus, spot conflicts, check for missing coverage, and link related material together

This way you get the precision that technical work depends on. At the same time, you can do retrieval, summarization, and higher level reasoning over the structured view on your corpus.

§Why this matters for real engineering work

It reduces false matches and missed matches that come from naive similarity, which is important when a small wording change can completely change the meaning of a technical statement.
It helps you spot conflicts, missing conditions, unclear logic, and inconsistent assumptions across large collections of technical material.
It gives you a structured, machine readable view of the content, which makes tasks like linking related items, generating tests, or checking coverage far more reliable.
It avoids the trap of thinking that vector search alone understands technical meaning, and replaces it with a representation that captures the actual logic, thresholds, and relationships engineers work with every day.

§Trade-offs and what you give up

All this isn’t a silver bullet. Some trade-offs you’re facing:

Higher upfront engineering effort to build parsers, dictionaries, signature schemas.
No guarantee you catch all corner cases: language in engineering texts is messy, natural-language parsing is brittle.
The approach works for text but it doesn’t automatically extend to images, sound, video.
You might miss subtle or context dependent meaning. For those cases, you may still need more powerful and more expensive language models.

Still: for the bulk of engineering contents, this hybrid + structured RAG approach often works a lot better than vector search alone.