Diving Into Semantic Search

By Oscar Frith-Macdonald, 28 August 2024

In this article, we’ll take a deeper dive into semantic searches—what they are, how they work, and when to use them. I'll start by answering some fundamental questions:

  • What is semantic search?
  • What is embedding?
  • How does embedding enable semantic search?
  • When should we use semantic search?

Next, we will share our journey of implementing AI in a FileMaker solution, including the step-by-step process of setting up a semantic search. And finally, there is a demo file provided so you can play with semantic searches yourself and see how different models perform. 

So, What Is Semantic Search?

A semantic search is a search technique that aims to improve search accuracy by understanding the context of terms within a search. 

Unlike traditional keyword-based searches, which match exact words, semantic searches consider the intent behind the search and the contextual relationships between words. 

What Is Embedding?

Embedding is the process of taking your complex data and converting it to a string of numbers, where each number represents some “attribute” of the original complex data. In natural language processing, embeddings convert words, phrases, or sentences into vectors of real numbers, capturing semantic meanings and similarities.

cat kitten

For a more detailed explanation of embedding, please refer back to our last AI article.

How Does Embedding Enable Semantic Search?

Embedding allows the semantic meaning of words or phrases to be given a set of numerical values. We can then compare these values to find data with a similar meaning rather than being an exact match of words or phrases.

When Should We Use Semantic Search?

The first thing you will need to consider is the type of data you want to work with. It's likely that a lot of the data in your FileMaker solution is actually too structured for semantic searches. For example, there is little point in doing a semantic search on invoice totals; this is very clear and structured data and best suited to traditional search methods. 

Semantic searches are much better used for searching blocks of text for meaning and context. So if you have all of your client meeting notes in a database, this would be far better suited to a semantic search.

Traditional Keyword Search:

  • Query: "project deadlines"
  • Results: Returns notes that explicitly contain the exact phrase "project deadlines."

Semantic Search:

  • Query: "project deadlines"
  • Results: Returns notes that discuss topics related to project deadlines, even if they use different phrases or terminology. For instance:
  1. "We need to finalise the timeline for project completion."
  2. "The client is concerned about the due dates for deliverables."
  3. "Ensure the milestones are met within the agreed schedule."
  4. "Deadlines for the project phases were reviewed and adjusted."

Embedding Your Data and Using Semantic Search

The first step in using AI is to configure your AI account. This is done with the Configure AI Account script step. This allows you to either select OpenAI or set a custom endpoint. You will likely want to incorporate this script step into your startup script to ensure that an AI session is started for a user when they open the system.

configure ai account.

Next, you will want to embed your data. As discussed, a lot of your data will not be suitable for a semantic search, and it's probable that even the data that is suitable could be improved. Given the previous example of client meeting notes, we can actually improve the context of those notes very simply by creating a field containing all the data you want to embed. The easiest way we found to do this is to create a calculation field that is a list that includes all the data you want to embed, in this case the client name and a brief description of what the notes are.

client meeting notes

Selecting a Model 

You may need to be careful with the model you use when embedding your data. Models have limits on the number of tokens sent per minute (TPM), the number of requests per day (RPD), and the number of requests per minute (RPM). Unfortunately, it's not really possible to say how big a token is; a token could be a whole word or just a single character, depending on how frequently the model has seen that word. More common words are likely to be tokens in themselves, whereas less common words could be split up as individual tokens for each character. OpenAI does offer some general rule-of-thumb calculations to make a good estimate of how many tokens a block of text is likely to have; these can be found here. Below are OpenAI’s free limits, though these can be increased if you pay. So if your meeting notes are pretty long and you don't want the end of your notes to be cut off, you may need to pay for higher limits, or another option may be to chunk your data and have smaller embeddings.

One thing to be aware of is that a request is made up of tokens, and with the limits, the token is king. For a very extreme example, say you are using the gpt-4o model and you have a request that is 30,000 tokens. While you technically have 500 requests per minute, you have used up all your tokens for that minute in one go, so you now can't make any more requests. This is a very unlikely scenario, but definitely something to keep in mind.

text embedding

Another consideration when selecting a model could be the size of your database. Different models will embed your data differently; some may create a vector of 500 dimensions, others could create a vector with 2000+ dimensions, and the more dimensions, the larger the size. 

When embedding data, the number of dimensions is constant, regardless of the data you are embedding. 

Single Word Input: "cat"

  • Vector Size: 1,536 dimensions (if the model's output is 1,536-dimensional)

Paragraph Input: "The cat sat on the mat."

  • Vector Size: 1,536 dimensions

Usually you will be embedding large blocks of text, so the size is likely to reduce, but if you are embedding small bits of data with a large model, you may actually create an embedding that is larger than the original data. In FileMaker, it is also important to embed your data into a container field to help reduce the size. You can embed directly into a text field, but in this case every character will be 2 bytes, and when one dimension is something like 0.003164375, that can add up to a lot of bytes (22 in this case, or 88 bits), whereas if you store the embedding in a container, this is saved as a binary number (0.0000000011001111011) and would take just 20 bits. Another good reason to embed into a container is that container fields can't be indexed, so if, for some reason, you do embed into a text field, you will need to make sure to turn off indexing for that field, or your file size could grow very quickly.

Open AI offers more generic models, but you will likely get better results by using a model trained on your data. For example, a generic model working with data about rocks is likely to group all the different rock types together because they are all just rocks. Whereas a model trained on data about rocks will do a far better job of distinguishing between different types of rocks. Think of it like asking your average Joe on the street about your data vs. asking a specialist in the field. 

Embedding your data

Before we get onto embedding your data, there are a few key things to remember and consider. To do a semantic search, you will need to embed your question and then compare that embedding to your embedded data to find the closest match. The important thing to remember here is that your question will need to be embedded with the same model that your data was embedded with. If that doesn't make much sense, don't worry; this will be explained in more detail further on; you just need to remember to use the same model.

Once you have selected your model, you will need to embed your data. We recommend at least keeping your embedded data in a separate table and potentially even a separate file. This will allow you to easily store and manage different embeddings. (Yes, I know we said you should use the same model for everything, but when getting started, you may want to test different models and see which works best for you.)

Another size issue to consider is that, though FileMaker allows you to save your embedded data into a text field, the recommended method is to use a container field. The reason for this is that a container field will be smaller than using a text field, and FileMaker can search faster on the container.

client notes embedded

The easiest way to embed your data is by using a found set, so first find your data. Once you have the data you want to embed, simply use the Insert Embedding in Found Set and select the field in your base table as the source and the container field in your embedded table as the target. This can be quite time-consuming, so it is probably best done server-side.

embedding found set

A Note on Embedding Your Data

As mentioned in the section on selecting your model, the number of dimensions remains constant, whether you embed a single word or an entire paragraph. This is an important consideration when embedding your data, as you may want to think about breaking down your data into smaller chunks before embedding. When performing a semantic search, the results reflect the context or meaning of the embedded content. For instance, if your meeting notes cover various topics across different paragraphs, embedding them as a whole might generalise the context of the entire meeting, leading to a loss of specific details. By chunking your data into smaller, more focused segments, you can preserve a clearer context for each part.

The need to chunk your data can also depend on the model you are using and the data you want to embed. A larger model with more dimensions is likely to capture and retain the context of a lengthy text block better than a smaller model, which might lose some nuances. Therefore, when using a smaller model, you might want to invest more time in carefully chunking your data before embedding. Unfortunately, there’s no one-size-fits-all approach here; it largely depends on the nature of your data, your budget, and the time you’re willing to spend on the process.

Adding a Semantic Search

The next step is to add a semantic search. To start with, we just added a text field for the search input. We then used the Perform Semantic Find script step to both embed the search entry and compare it to the embedded client notes. This script step also lets you choose how many results are returned, and the cosine similarity (a cosine similarity of 1) means identical data. (A cosine similarity of -1 means opposite data.)

preform sematic find

This did produce reasonably good results, but due to how embedding data and using cosine similarity work, we found that slightly modifying the search input did help to improve the results. A semantic search is able to determine meaning and context, but to an extent, this is just how similar your search input is to the result without being an exact match. This means that if we can make our search look more like the embedded data, we will get better results.

The way we made the search entry more like the already embedded client notes was to add a pop-up entry for the client name and then convert the search input into a list before performing the semantic search. This should improve the semantic search as the cosine similarity for ClientName in the already embedded data and the SearchClientName should be 1.

pioneer solutions

client name search

A note on where to preform your searches

As we suggested, your data should be in a separate table or file, so when performing your search, you should go to the table the embedded data is in to perform the search rather than trying to search through a relationship. Better still would be to perform your search server side and then just return the relevant records. Otherwise, you are trying to search through all the embedded data locally, and those embeddings can be a lot of data to transfer to the client.

This is also an argument for trying to perform some standard FileMaker searches to reduce your found set before performing a semantic search. This will greatly increase the performance of the semantic search.

Sorting your results

A nice feature of the Perform Semantic Find script step is that it returns your results sorted by cosine similarity, which is great if your embeddings are in the same table as the unembedded data. However, we suggested keeping your embedded data in a separate table, which likely means you will be using a Go To Related Record script step to display the found records, and you will lose the sort order. 

To get around this, we added an unstored calculation field on the Client Meeting Notes table, which is the cosine similarity. This also requires embedding your search input into a global variable before doing the semantic search.

cosine similarity

 

CosineSimilarity Calculation Unstored, = CosineSimilarity($$EmbeddedSearchinput; ClientNotes_Embedded::ClientNotesEmbedded_Small)

 

Once the Go To Related Record script step is complete, you can sort by the cosine similarity field. At least working locally, the performance on this is pretty good, as you are likely only returning a small number of records. However, if your solution is not local or you are retuning lots or records, then you will likely see a hit to performance. 

Improving your results

We found that the semantic search did a great job at finding relevant records but wondered if it could be improved. We sometimes found that the most relevant result for our search was not actually the most useful result. So one method to improve the search would be to make it a two-part search.

For example, you let the user do a semantic search with natural language and return the results for that. The user can select the result they think is best and use that result for another semantic search, this time using the embedded data for that record as the search input. The chances are this will contain a lot more information than the user's natural language search, so it may produce better results.

Another thing you may want to test is making that two-part search a behind-the-scenes search. So you take the top result from the natural language search, use that record for the second semantic search, and only present the results from the second search to the user. If you are doing this, then you want to be very sure the first result is a good result; otherwise, you might be searching for a bad result for similar bad results. One way to mitigate this would be that you only perform the second search on the record if the cosine similarity is greater than 0.8 (or whatever value you find works best for you).

Finally, you may want to do an initial standard FileMaker search to get a reduced found set, then perform your semantic search on that found set. 

Conclusions

Semantic search tools can be really helpful and powerful tools, but only when the data suits that kind of search. So going forward with FileMaker solutions, you will want to be aware of what data might suit a semantic search and then make sure that data is in a good structure for a semantic search.

Ultimately, it's likely you will have to do quite a bit of testing around the semantic search to find the best option for your data. But once setup, you will have a very powerful search built into your solution.

Demo

Now wouldn't you like to see firsthand how different models and query formatting can affect your semantic search? Well, fear not, we have a demo file where you can have a play yourself, look under the hood and see the data structure, try and enter some really obscure prompts, and see if you can get a result you just wouldn't be able to with a standard FileMaker search. This demo file contains around 4000 book summary records with large and small embeddings for each book.

search options

 

Some Interesting Stats About The Demo File

When it came to providing this demo we came up against a couple of hurdles with the size of the file, so we dug down into what exactly was causing this demo to be so much larger than our other demos — over 200 MB in fact:

For starters the demo admittedly has many more records than we normally include as we wanted to provide a good selection of sample data for you to play with.

But the raw data is not what is causing the size issue, in fact the raw book data, all the images and text in the demo file, scripts, etc came to just 38MB.

The bulk of the file size comes from the large embeddings at 100.6MB which is double the size of the small embeddings at 50.3MB.

Again this highlights the importance of testing and finding the best solution for you. If size is a consideration for you, then you may want to opt to use an embedding model with fewer dimensions.

Something to say? Post a comment...

Comments

No one has commented on this page yet.

RSS feed for comments on this page | RSS feed for all comments

Categories(show all)

Subscribe

No Tags