The science behind the Raffle-lution: Natural language search

What is NLP? Read more about natural language processing within Raffle and how the software understands contextual meaning from text.

Product

Reading time:

By

Raffle

TABLE OF CONTENTS

This is the second in our “The science behind the Raffle-lution” series. Read the first here.

When we think about lightning-fast search capabilities today, we think about the masters: Google. Using their search engine, you can search the entire internet at the click of a button.

How did they achieve this? Search engines represent text in any document (or web page) with an index which in its simplest form is a list of all the words that occur there. A collection of such documents is represented with a document-term matrix.

This matrix is usually sparse (contains a lot of zeros) because each document only contains a small subset of all possible words; this means that lookup is extremely fast.

Google: undisputed masters of internet search.

Traditional search has limitations

However, a traditional search index is like putting a document's words in a bag and then shaking it. The detailed meaning is gone.

We can partially make up for that by storing not only single words in the bag but also terms that are made up of two words or more such as “New York.”

But this isn’t the solution because we have to store many different terms. Google’s search index stores a lot of terms — how many is unknown, but you can get an impression of the scope with Google Trends.

From sparse to dense search indices

As discussed in our last blog post, Raffle machine learning solutions for natural text search are built upon context-aware text representations.

So, where the traditional search approach is fundamentally limited, the machine learning approach can, with sufficient training data, learn to pick up subtle contextual differences that may make the difference in finding the correct answers.

Recently, Google has made a major improvement in workplace search results by using BERT in natural text searches. We see this as a part of an overall trend in search where we go from the traditional sparse search indices to learned dense representations of documents and queries.

A new way to answer questions

In recent work from both Google and Facebook AI research, we see this approach to question answering at scale using fine-tuned BERT models. These follow a two-stage process:

Document retrieval. The document retriever first encodes the whole knowledge base (for example, the entirety of Wikipedia!) into a couple of million dense representations. The question is encoded and based on a similarity measure (inner product). The top 5 to 10 excerpts of the knowledge base are passed on to the answer generation stage.‍
Answer generation. The answer generator uses the excerpts from the knowledge base together with the question to generate an answer either by extracting pieces of text from the excerpts or by a generative language model that composes the answer.

Raffle’s products use a similar document retrieval model. Each retrieved document is an answer by itself, so there is no need for an answer generation module. Our “secret sauce” is our methodology to fine-tune from very small labeled datasets.

In the next blog post, we will discuss how open-source NLP frameworks can accelerate the development of valuable natural text products.

The last blog post in this series will be about the mid-to-long-term perspectives for NLP AI. Specifically, we can expect to get truly artificial intelligent conversational AIs that are context-aware, factually accurate, and not prone to pick up unwanted biases in data.

An AI search engine trained on YOUR content.

More from the Newsroom

Article

January 16, 2025

Using Raffle AI to Boost Student Retention in UK Universities

The science behind the Raffle-lution: Natural language search

What is NLP? Read more about natural language processing within Raffle and how the software understands contextual meaning from text.

This is the second in our “The science behind the Raffle-lution” series. Read the first here.

When we think about lightning-fast search capabilities today, we think about the masters: Google. Using their search engine, you can search the entire internet at the click of a button.

How did they achieve this? Search engines represent text in any document (or web page) with an index which in its simplest form is a list of all the words that occur there. A collection of such documents is represented with a document-term matrix.

This matrix is usually sparse (contains a lot of zeros) because each document only contains a small subset of all possible words; this means that lookup is extremely fast.

Traditional search has limitations

However, a traditional search index is like putting a document's words in a bag and then shaking it. The detailed meaning is gone.

We can partially make up for that by storing not only single words in the bag but also terms that are made up of two words or more such as “New York.”

But this isn’t the solution because we have to store many different terms. Google’s search index stores a lot of terms — how many is unknown, but you can get an impression of the scope with Google Trends.

From sparse to dense search indices

As discussed in our last blog post, Raffle machine learning solutions for natural text search are built upon context-aware text representations.

So, where the traditional search approach is fundamentally limited, the machine learning approach can, with sufficient training data, learn to pick up subtle contextual differences that may make the difference in finding the correct answers.

Recently, Google has made a major improvement in workplace search results by using BERT in natural text searches. We see this as a part of an overall trend in search where we go from the traditional sparse search indices to learned dense representations of documents and queries.

A new way to answer questions

In recent work from both Google and Facebook AI research, we see this approach to question answering at scale using fine-tuned BERT models. These follow a two-stage process:

Document retrieval. The document retriever first encodes the whole knowledge base (for example, the entirety of Wikipedia!) into a couple of million dense representations. The question is encoded and based on a similarity measure (inner product). The top 5 to 10 excerpts of the knowledge base are passed on to the answer generation stage.‍
Answer generation. The answer generator uses the excerpts from the knowledge base together with the question to generate an answer either by extracting pieces of text from the excerpts or by a generative language model that composes the answer.

Raffle’s products use a similar document retrieval model. Each retrieved document is an answer by itself, so there is no need for an answer generation module. Our “secret sauce” is our methodology to fine-tune from very small labeled datasets.

In the next blog post, we will discuss how open-source NLP frameworks can accelerate the development of valuable natural text products.

The last blog post in this series will be about the mid-to-long-term perspectives for NLP AI. Specifically, we can expect to get truly artificial intelligent conversational AIs that are context-aware, factually accurate, and not prone to pick up unwanted biases in data.

Read the customer story

The science behind the Raffle-lution: Natural language search

Traditional search has limitations

From sparse to dense search indices

A new way to answer questions

Get an AI assistant for your website

More from the Newsroom

What are AI Hallucinations?

Top 5 AI Tools for Businesses in 2024

What are API integrations?

Using Raffle AI to Boost Student Retention in UK Universities

The science behind the Raffle-lution: Natural language search

Traditional search has limitations

From sparse to dense search indices

A new way to answer questions

More Videos from Raffle

Raffle AI Search for energy companies

What our customers say about us

How to implement Raffle in any website in minutes

5 hacks to improve CX in 2024

Other contents from Newsroom

The Importance of integration in the world of AI search and chat

Understanding the Climate Impact of Generative AI: A Balance Perspective

Chatbot vs Conversational AI: What Are 5 Differences?

Raffle as a market key player

Ready to Experience the
‍Raffle Difference?

Solutions

Features

Industries

Company

Resources

The science behind the Raffle-lution: Natural language search

Traditional search has limitations

From sparse to dense search indices

A new way to answer questions

An AI assistant trained on your content

Get an AI assistant for your website

More from the Newsroom

What are AI Hallucinations?

Top 5 AI Tools for Businesses in 2024

What are API integrations?

Using Raffle AI to Boost Student Retention in UK Universities

Your competitors are getting AI on their websites. Are you?

The science behind the Raffle-lution: Natural language search

Traditional search has limitations

From sparse to dense search indices

A new way to answer questions

More Videos from Raffle

Raffle AI Search for energy companies

What our customers say about us

How to implement Raffle in any website in minutes

5 hacks to improve CX in 2024

Other contents from Newsroom

The Importance of integration in the world of AI search and chat

Understanding the Climate Impact of Generative AI: A Balance Perspective

Chatbot vs Conversational AI: What Are 5 Differences?

Raffle as a market key player

Ready to Experience the ‍Raffle Difference?

Solutions

Features

Industries

Company

Resources

Ready to Experience the
‍Raffle Difference?