Raffle makes AI tools to realize our vision of giving employees or end-users seamless access to company information. We use natural language processing (NLP) machine learning so that the user can search with natural text in the same way you would pose questions to humans.
Machine learning needs training data to work — and the more, the better. Making data, for example, by labeling historical queries (that associates them with the correct answers) is time-consuming and, therefore, expensive and will delay when your AI solution performs well enough to be deployed.
So it sounds like natural text search is out of reach for companies who don’t have the resources to make sufficient training data. But this is actually no longer the case.
Pre-train to make gains
NLP has made a lot of progress in recent years because of what we call pre-training. This has been a real game-changer for achieving good performance with a smaller investment. To explain pre-training, we need to be a bit more specific about what we mean by training data when we’re discussing NLP:
- Unlabeled data. This can be text data we collect from the internet or text available in companies. Practically unlimited unlabeled data is available, but we have to be careful with what we use because our model will learn from it.
- Labeled data. This is expensive data. At raffle.ai, our supervised data consists of question-answer pairs. We, therefore, need to access a number of questions for each answer. The question can come from historical query logs, be collected live and labeled in Raffle Insights, or even be constructed by our in-house AI trainers to start the model off at a reasonable level of performance.
How to train your language model
Once trained, a language model can “understand” the meaning of sentences. Or, more precisely, if we take two sentences with the same meaning, their representations will be similar.
This is a very good foundation for building other NLP applications, such as a question-answering system, because we now have a way to represent our questions in a way that robustly reflects how we ask them.
So the NLP application recipe à la 2020 is to:
- Pre-train a language model with unlabeled data or — even better — get someone else to supply one for us
- Fine-tune on a small labeled dataset
But how do we leverage large unlabeled datasets to get representations that learn the meaning of sentences? The key here is the context: a single word in a sentence gets part of its meaning from the surrounding text.
So if we train a model to predict a word given a context such as preceding words: “Josef walks his <fill in="">” or from surrounding words: “the cat <fill in=""> the mouse,” then the model is forced to learn a representation which is context-aware. </fill></fill>
BERT and beyond
There are many language models on the market. An early famous one is word2vec. A fascinating finding of its representations is that you can do approximate arithmetic with them, such as: “king” - “man” + “woman” ≈ “queen”.
Today, the most popular one is BERT which is short for Bidirectional Encoder Representations from Transformers. BERT is a masked language model, which means that the model’s task is to predict one or more words that have been masked out of the input, as shown in the example below.
As is often the case in deep learning, more data and larger models help performance. The standard pre-trained BERT model is a 300 million parameter transformer model trained on the entire Wikipedia and other sources.
It sounds gigantic, but it is actually possible to put it into production and run it without noticeable lag for the user. You can try it out with Raffle today.
In the next post in this series, we’ll look closer at how we fine-tune to solve question-answering tasks. We will also look into a larger trend of how search is changing.
Ready to use the power of NLP to empower your customers?
Talk to one of our product specialists today to find out more about our products.