How Vector Embeddings Enable Semantic Searching in AI

May 3, 2023

Semantic Search 

Searching through an organization’s documents can feel like looking for a needle in a haystack. Except the haystack is the size of Mount Everest. 

The simplest text search method is “lexical”.  In basic terms the search engine tries to match the keywords in your query to keywords in the documents. This is fast and works well enough in most cases. Unfortunately, the English language is full of synonyms, homonyms, and ambiguity. 

What if a search engine could understand the contextual meaning and intent behind our query? This is where semantic search comes in. In this article we will be talking specifically about text embedding vector-based search. That’s quite the mouthful. Let’s see if we can break it down.

The Problem with Traditional Search 

To better understand how vector embeddings can support the retrieval of data, let’s look at an example of how they allow machine learning algorithms to better capture the underlying relationships and patterns in the data, leading to more accurate predictions and insights. 

In a traditional database you can perform filters like this:

 

                 WHERE[ID] = 35356323 

                 WHERE[SALES] > 200000 AND [REGION] != ‘NORTHWEST’ 

                 WHERE[SALES_DATE] > ‘2023-01-01’ 

These kinds of queries work well enough for structured data.  But how do you search unstructured data like text or images? For example, how would you index text like this:

 

The Maine Coon is a large, domesticated cat breed. It is one of the oldest natural breeds in North America.

The breed originated in the U.S. state of Maine, where it is the official state cat.

The Maine Coon is a large and social cat, which could be the reason why it has a reputation of being referred to as "the gentle giant."

The Maine Coon is predominantly known for its size and dense coat of fur which helps the large feline to survive in the harsh climate of Maine.

The Maine Coon is often cited as having "dog-like" characteristics.

A Maine Coon for reference - in case you also had no idea what it was :)

This is a common problem and one that search engines have been dealing with for decades.  The simplest approaches involve keywords.  You search for ‘cat’ and return all documents with the word ‘cat’ in them. 

What if you search for ‘big, long-haired kitty’? Clearly this article would be relevant to that search, but none of the search words are found in the article.  You could use synonym lookups but those are difficult to maintain and error prone.  Semantic search goes beyond simple keywords and strives to include the intent and contextual meaning.

How Vector Embeddings Can Help 

Vector embedding works by analyzing enormous amounts of text data to identify patterns and relationships between words. We can then use this analysis to convert words and text into an array of numbers called a vector. These vectors encode the meaning of the text and are much easier for computers to work with. Let’s look at a simplified example.

In our simple embedding we can see that words that are like each other are grouped closely.  Words that are not similar are faraway.  We can measure these distances with simple cosine formulas. The distances between words can be interpreted as how semantically close they are.

Embeddings can be thought of as coordinates in an abstract semantic space. In our simplified example we are just using X, Y coordinates.

In real use these embedding vectors would be much larger.  OpenAI’sADA-02 embedding is 1,536 dimensions.  Our example is just using single words, but we can also convert phrases, whole documents, images, and more to embeddings. 

Let’s go back to our original example and assign each document in our library to an embedding vector.

 

Maine Coons.docx                         =[12, 12]

2023 Financial Report.xlsx          = [5 ,5]

Governance Plan.pdf                    =[-6, -6]

 

When we run the search “big, long-haired kitty” through ourmodel to get an embedding of [12, 11].  It is easy to see that MaineCoons.docx is the closest match.

Vector Databases 

Searching is easy if you only have three documents, but what if we have millions or even billions of documents?  Checking the distance from the search embedding to each document embedding would be computationally expensive.  This is where vector databases come in. These databases have developed efficient ways of storing and searching vector embeddings.

Vector databases have seen a surge in popularity because they pair well with large language models like ChatGPT. ChatGPT plug-ins combined with a vector database allow users to safely query their own data without feeding it into the larger model.

Some examples of Vector databases: Weaviate, Pinecone, Zilliz

Conclusion 

In conclusion, semantic search powered by vector embeddings offers a more sophisticated and accurate approach to information retrieval than traditional keyword-based methods. By capturing the contextual meaning and intent behind a query, semantic search can better understand unstructured datalike text and images.

As technology continues to evolve, we can expect further advancements in semantic search, opening new possibilities for data analysis and insights.

 

Further Reading:

Vector Embeddings Explained

A Brief History of Word Embeddings

ChatGPT Retrieval Plugins

About the Author:

Bradley Nielsen

Senior Tech Specialist

Bradley is a well-rounded developer in the field of data science and analytics. He has been a developer and architect on a wide range of data initiatives in multiple industries. Bradley's primary specialty is in data engineering: developing, deploying, and supporting data pipelines for big data and data science. He is proficient in Python, C#, SQL Server, Apache Spark, Snowflake, Docker, and Azure.

Related Partners

Related Services

Related Technologies

Related Industries

Stay in Touch with Onebridge

* Indicates required field
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Hey there! We hope you've noticed that none of our content is "gated," meaning we don't force you to provide your information in order to read our content. We work hard to provide valuable information to serve our audience and our clients, and we're proud of it.

If you'd like to be notified of new content, events, and resources from Onebridge, sign up for our newsletter here. After signing up, you'll get a profile link where you can tell us what topics you want to hear about.With Onebridge, you control your data.

Please follow us on social media to see upcoming events and other resources, like blogs, eBooks, and more!