Question answering systems are being heavily researched at the moment thanks to huge advancements gained in the Natural Language Processing field. Key players in the industry have developed incredibly advanced models, some of which are already performing at human level. This is also the case for BERT (Bidirectional Encoder Representations from Transformers) which was developed by researchers at Google.
In this article we’re going to use DistilBERT (a smaller, lightweight version of BERT) to build a small question answering system. This system will process text from Wikipedia pages and answer some questions for us. We are then going to put our model to test with some questions and analyze the results.
Article overview
- Approach for building a question answering system
- Project setup
- Building a Wikipedia text extractor
- Question processing
- Searching for relevant context – BM25
- Using DistilBERT for question answering
- Building the question answering logic
- Testing our system
- Related articles
- Conclusions
Approach for building a question answering system
Our question answering system will work in 4 stages:
- Extract text from Wikipedia: We will download text from a few Wikipedia articles in order to build our dataset. I will cache the text in my local environment because there is no need to download the same text again and again everytime I make changes to the system.
- Process the question: Here I’m going to extract the most important bits of the input question, because using every word in the question would lower the accuracy of the results.
- Retrieve context from the text: Given an input question, we will try to find the most relevant sentences in the entire data corpus. This will help have a small search space for our answer retriever model and will lead to a higher accuracy
- Retrieve answer from the context: This is where our BERT model will come into action. We will feed the context from the earlier step to our model and will get our answer in return.
What I’m trying to do here is what I think is found behind the instant answers that search engines sometimes offer for some search queries. If you Google “what is the capital city of Romania?” you will first get an answer box with “Bucharest” and results from other pages around the internet come below this box.
What my intuition tells me is that the search engine looks at your query and tries to find first the most relevant pages related to your question and it then looks at these pages and tries to extract a direct answer for you. This is what I also tried to do for this project.
Project setup
As I was writing in the beginning of this article, a lot of research is going on in this field and the community can only benefit from this. A lot of tools have been built using the latest research results and awesome tools like this are exactly what makes this project not only possible, but also very easy and quick 😊.
First let’s install spaCy, a library which I really like and which I’ve been using in many projects, such as building a knowledge graph or analyzing semantic relationships. I’m also going to download the small version of the spaCy language model for English. Larger models are available but the small version is just enough for this project.
pip install spacy
python -m spacy download en_core_web_sm
It’s time now to install wikipedia, an awesome package for extracting text from Wikipedia pages.
pip install wikipedia
Next up is Gensim, another package which I really enjoy using, especially for its really good Word2Vec implementation.
pip install gensim
For the last 2 dependencies, I’ll install pytorch and transformers from HuggingFace 🤗. It’s my first time using these 2 packages but I think they are really powerful and really easy and fun to work with.
pip install torch
pip install transformers
Now, with all our dependencies in place, it’s time to start building our question answering system.
Building a Wikipedia text extractor
If you’ve been reading other articles on this blog you might already be familiar with my approach for extracting articles from Wikipedia pages. I know it’s not the best or most efficient way of extracting the text, but it’s quick and easy and let’s you build a small, play dataset for a project.
First let’s write a small class to extract the text from one Wikipedia page. Let’s create a text_extractor.py file and put it in our project directory.
import wikipedia
import os
class TextExtractor:
__pageTitle: str
__pageId: str
def __init__(self, pageTitle, pageId):
self.__pageTitle = pageTitle
self.__pageId = pageId
def extract(self):
fileName = "./text/" + self.__pageTitle + ".txt"
if not os.path.isfile(fileName):
page = wikipedia.page(title=self.__pageTitle, pageid=self.__pageId)
f = open(fileName, "w")
f.write(page.content)
f.close()
def getText(self):
f = open("./text/" + self.__pageTitle + ".txt", "r")
return f.read()
The approach is very simple here. The constructor takes 2 params, a page title and a page id. The reason for also requiring a page id is because I noticed that sometimes the wikipedia package gets confused for some titles and that’s why I prefer to also use this param. To extract the page id for one Wikipedia article, go to Wikidata and search for your article there. The page id is the one in the brackets right after the title of your result.
As I said earlier, I’m storing the text in a local directory (/text) so that downloading the text is not necessary for every run of the project.
The second class needed for this step is a text extractor pipe. This allow us to collect multiple TextExtractor instances and combine the text from all of them into one big chunk. This is the content of the text_extractor_pipe.py file.
from text_extractor import TextExtractor
class TextExtractorPipe:
__textExtractors: [TextExtractor]
def __init__(self):
self.__textExtractors = []
def addTextExtractor(self, textExtractor: TextExtractor):
self.__textExtractors.append(textExtractor)
def extract(self) -> str:
result = ''
for textExtractor in self.__textExtractors:
result = result + textExtractor.getText()
return result
Question processing
It’s time for the first real NLP step of this project. I’m going to do a little bit of question processing here. By that I mean I’m going to remove stop words from the original question text and keep only the essential parts. For example:
Original question: “What is the capital city of Romania?”
Processed question: “capital city Romania”
Why am I doing this? You might notice that the text contains words that are not necessarily essential for the question. This is especially for the purpose of this step, because we need to extract only the sentences that are the closest of all to our original question. Words like “what”, “is”, and especially “the” appear in too many places in our dataset and that can lower the accuracy of our search.
You might argue that the other words are important too, because once I find mentions of the capital city of Romania in the dataset, I need to know what to extract from there, what is the question that I need to answer too. And you’re right, don’t worry about it, we’ll also keep the original question because we are going to reuse it later. But for searching purposes, the processed question should be enough.
I’m going to use spaCy to process the question. The logic here is very simple, I’m going to apply spaCy’s NLP model to the question text in order to tokenize it and identify the parts of speech of all the words in the question. Then I’m going to keep only the parts of speech I’m interested in: nouns, proper nouns, and adjectives.
Here are the contents of question_processor.py.
class QuestionProcessor:
def __init__(self, nlp):
self.pos = ["NOUN", "PROPN", "ADJ"]
self.nlp = nlp
def process(self, text):
tokens = self.nlp(text)
return ' '.join(token.text for token in tokens if token.pos_ in self.pos)
Searching for relevant context – BM25
Here starts the actual search for the context in which the answer to our question will probably be found. But first, we need to mention what BM25 is.
BM25 is a function or an algorithm used to rank a list of documents based on a given query. That’s why it is also called a ranking function. It is very similar to TF-IDF and it is actually so good that I understand it is used in ElasticSearch for document ranking. I’m not going to go into the maths behind BM25 because it is a little too complicated for the purpose of this project, but the most relevant aspects here are:
- It is a bag-of-words model, and that means the algorithm disregards grammar structure but takes into account term frequencies – making it just ideal for our project.
- It takes a query and helps us sort a collection of documents based on how relevant they are for that query.
- The Gensim package has a very good BM25 implementation that is very easy to use.
I see only good news in the list above, so let’s get working 😃. Here’s the approach I’m going to use:
- Get a list of all sentences in our dataset and the processed question.
- Tokenize all our sentences and use lemmas of the words instead of the original words. The lemma of a given word is its base form (for example, we’re transforming “running” to “run”) and we are using it in order to improve the accuracy of our search. We’re also doing it for the question text. If you want to know more about lemmatization and stemming you can read this article.
- Use the BM25 ranking function to rank all our documents against the given query.
- Extract the top N results from the step above and build a paragraph out of all those N sentences.
Here is the content of context_retriever.py
from gensim.summarization.bm25 import BM25
class ContextRetriever:
def __init__(self, nlp, numberOfResults):
self.nlp = nlp
self.numberOfResults = numberOfResults
def tokenize(self, sentence):
return [token.lemma_ for token in self.nlp(sentence)]
def getContext(self, sentences, question):
documents = []
for sent in sentences:
documents.append(self.tokenize(sent))
bm25 = BM25(documents)
scores = bm25.get_scores(self.tokenize(question))
results = {}
for index, score in enumerate(scores):
results[index] = score
sorted_results = {k: v for k, v in sorted(results.items(), key=lambda item: item[1], reverse=True)}
results_list = list(sorted_results.keys())
final_results = results_list if len(results_list) < self.numberOfResults else results_list[:self.numberOfResults]
questionContext = ""
for final_result in final_results:
questionContext = questionContext + " ".join(documents[final_result])
return questionContext
Using DistilBERT for question answering
DistilBERT is a simpler, more lightweight and faster version of Google’s BERT model and it was developed by HuggingFace. It runs faster than the original model because it has much less parameters but it still keeps most of the original model performance.
The logic is pretty simple here:
- Load the pretrained models for tokenization and for question answering from the transformers library.
- Tokenize the question and the question context.
- Use the question answering models to find the tokens for the answer.
- Convert answer tokens back to string and return the result.
I’ve added this logic to answer_retriever.py.
import torch
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
class AnswerRetriever:
def getAnswer(self, question, questionContext):
distilBertTokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', return_token_type_ids=True)
distilBertForQuestionAnswering = DistilBertForQuestionAnswering.from_pretrained(
'distilbert-base-uncased-distilled-squad')
encodings = distilBertTokenizer.encode_plus(question, questionContext)
inputIds, attentionMask = encodings["input_ids"], encodings["attention_mask"]
scoresStart, scoresEnd = distilBertForQuestionAnswering(torch.tensor([inputIds]),
attention_mask=torch.tensor([attentionMask]))
tokens = inputIds[torch.argmax(scoresStart): torch.argmax(scoresEnd) + 1]
answerTokens = distilBertTokenizer.convert_ids_to_tokens(tokens, skip_special_tokens=True)
return distilBertTokenizer.convert_tokens_to_string(answerTokens)
Building the question answering logic
It’s time to write our entire question answering logic in our main.py file.
- I’ll first use the TextExtractor and TextExtractorPipe classes to fetch the text and build the dataset.
- Then I’m going to load the spaCy NLP model and use it to split the text into sentences. I’ll pass the same NLP model to the QuestionProcessor and ContextRetriever instances as described above.
- I’m going to store the original question text in a variable and feed that to the question processor.
- From there, I’ll pass the sentences list and the processed question to the ContextRetriever instance.
- Lastly, the original question and the context will be passed to an AnswerRetriever instance in order to get the final result.
import spacy
from question_processor import QuestionProcessor
from text_extractor import TextExtractor
from text_extractor_pipe import TextExtractorPipe
from context_retriever import ContextRetriever
from answer_retriever import AnswerRetriever
textExtractor1 = TextExtractor("London", "Q84")
textExtractor1.extract()
textExtractor2 = TextExtractor("Berlin", "Q64")
textExtractor2.extract()
textExtractor3 = TextExtractor("Bucharest", "Q19660")
textExtractor3.extract()
textExtractorPipe = TextExtractorPipe()
textExtractorPipe.addTextExtractor(textExtractor1)
textExtractorPipe.addTextExtractor(textExtractor2)
textExtractorPipe.addTextExtractor(textExtractor3)
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(nlp.create_pipe('sentencizer'))
doc = nlp(textExtractorPipe.extract())
sentences = [sent.string.strip() for sent in doc.sents]
questionProcessor = QuestionProcessor(nlp)
contextRetriever = ContextRetriever(nlp, 10)
answerRetriever = AnswerRetriever()
originalQuestion = "What is the capital city of Romania?"
questionContext = contextRetriever.getContext(sentences, questionProcessor.process(originalQuestion))
print (originalQuestion)
print (questionProcessor.process(originalQuestion))
answer = answerRetriever.getAnswer(originalQuestion, questionContext)
print (answer)
Testing our system
Ok, it’s time to test my system and see what I’ve accomplished. I’m going to ask some test questions and see if the model can answer them. For this test I’ve downloaded the content of London, Berlin and Bucharest Wikipedia pages.
For every question, I’ll display the original question, the processed question and the anwer from our newly built question answering system.

Amazing! 😊 The system is able to answer all those questions (and many more) very well! Two notes I want to make here:
- Please note all answers are lowercase because I’ve loaded the uncased distilBERT model but that’s still okay.
- I got really lucky on some answers (for example the one with UiPath). My logic failed to properly process the question but luckily there weren’t many mentions of the company in my small dataset.
But all in all I’m impressed by how the model managed to perform on these questions.
There are of course questions for which the system was not able to answer correctly. But one which I was really surprised with was “What’s the capital of Romania?”. Notice that in my example above I asked “What is the capital city of Romania” and that worked correctly, but if I remove the word “city”, the model is not capable on finding the answer.
I’m sure it would be possible on a bigger, better dataset but still I was really surprised. But as I said, I’m really happy with the results from this project.
Related articles
Throughout the articles I usually make references to other articles on this blog, I’ll also add them here for ease of reference, if you want to check them out.
- What Is Natural Language Processing? A Gentle Introduction to NLP
- Python NLP Tutorial: Building A Knowledge Graph using Python and SpaCy
- Python Knowledge Graph: Understanding Semantic Relationships
- Explained: Word2Vec Word Embeddings – Gensim Implementation Tutorial And Visualization
- TF-IDF Explained And Python Sklearn Implementation
- Lemmatization And Stemming In NLP – A Complete Practical Guide
Conclusions
In this article we’ve played a little bit with a distilled version of BERT and built a question answering model. We’ve played with it for a little bit and saw some examples where it worked beautifully well, but also examples where it failed to meet the expectations. All in all, it was a really fun project to build and I hope you have enjoyed it too!