This is the second in our “The science behind the Raffle-lution” series. Read the first here.
When we think about lightning-fast search capabilities today, we think about the masters: Google. Using their search engine, you can search the entire internet at the click of a button.
How did they achieve this? Search engines represent text in any document (or web page) with an index which in its simplest form is a list of all the words that occur there. A collection of such documents is represented with a document-term matrix.
This matrix is usually sparse (contains a lot of zeros) because each document only contains a small subset of all possible words; this means that lookup is extremely fast.
Traditional search has limitations
However, a traditional search index is like putting a document's words in a bag and then shaking it. The detailed meaning is gone.
We can partially make up for that by storing not only single words in the bag but also terms that are made up of two words or more such as “New York.”
But this isn’t the solution because we have to store many different terms. Google’s search index stores a lot of terms — how many is unknown, but you can get an impression of the scope with Google Trends.
From sparse to dense search indices
As discussed in our last blog post, Raffle machine learning solutions for natural text search are built upon context-aware text representations.
So, where the traditional search approach is fundamentally limited, the machine learning approach can, with sufficient training data, learn to pick up subtle contextual differences that may make the difference in finding the correct answers.
Recently, Google has made a major improvement in workplace search results by using BERT in natural text searches. We see this as a part of an overall trend in search where we go from the traditional sparse search indices to learned dense representations of documents and queries.
A new way to answer questions
In recent work from both Google and Facebook AI research, we see this approach to question answering at scale using fine-tuned BERT models. These follow a two-stage process:
- Document retrieval. The document retriever first encodes the whole knowledge base (for example, the entirety of Wikipedia!) into a couple of million dense representations. The question is encoded and based on a similarity measure (inner product). The top 5 to 10 excerpts of the knowledge base are passed on to the answer generation stage.
- Answer generation. The answer generator uses the excerpts from the knowledge base together with the question to generate an answer either by extracting pieces of text from the excerpts or by a generative language model that composes the answer.
Raffle’s products use a similar document retrieval model. Each retrieved document is an answer by itself, so there is no need for an answer generation module. Our “secret sauce” is our methodology to fine-tune from very small labeled datasets.
In the next blog post, we will discuss how open-source NLP frameworks can accelerate the development of valuable natural text products.
The last blog post in this series will be about the mid-to-long-term perspectives for NLP AI. Specifically, we can expect to get truly artificial intelligent conversational AIs that are context-aware, factually accurate, and not prone to pick up unwanted biases in data.