TroublingFox125

Sign in

Using spaCy to structure unstructured texts and display the results in a one page dashboard

Understanding texts and extracting useful information out of them is often time consuming. That’s where spaCy comes in, together with React-Plotly. These two amazing libraries can save you a lot of time getting the right data out of unstructured texts.

What we’re trying to achieve in this part is to process a text (or file) into a new file that gives us a better look at what data is inside this file, without having to spend countless hours analysing it ourselves.

Preprocessing
Because our input data is an excel-file, we can simply read this with pandas and turn it into a pandas DataFrame.
Pandas is a great library for filtering tables, which is perfect in this situation.

import pandas as pd data = pd.read_excel(file)

For our preprocessing we want to remove all rows where one or multiple columns are empty, as well as replacing given acronyms with their full word(s). This all so that we have less missing data to worry about when analysing and processing the data later on.

import glob# dropping all empty column(s)# replace acronyms with their full word
import os
file.dropna(how="all", axis=1, inplace=True)
files = glob.glob(os.path.join("resources/Shiftbooks/Documenten", "*.xlsx"))
for excel in files:
acronyms = pd.read_excel(excel, header=None)
acronyms.columns = ["Acronym", "Meaning"]
acronyms_dict = dict(zip(acronyms.Acronym, acronyms.Meaning))
file = file.replace(acronyms_dict, regex=True)
# dropping all rows that contain empty values
file.dropna(inplace=True)

Note that the excel-file you use as an acronyms file only consists of 2 columns; the first one being the acronym and the second one being the meaning of this acronym (or full word in this case).

Processing
We will be using Natural Language Processing (NLP) to further analyse the DataFrame. SpaCy is an amazing library to get this done because it’s easy to set up, and it comes with some premade NLP pipes which can then be used to further process the data with.

import nl_core_news_md as dutchprint(nlp.pipe_names)
nlp = dutch.load()
# ["tok2vec", "morphologizer", "tagger", "parser", "ner", "attribute_ruler", "lemmatizer"]

Because we’re going to analyse dutch texts, we went for the nl_core_news_md option. If you plan on analysing texts of a different language, please check spacy.io/models to find the correct import.

The spaCy model comes with a Named Entity Recognizer, yet this NER only has a few categories and thus isn’t optimal. Because of that we’ll be adding more categories to the NLP pipeline so that almost every sentence has at least one word that is categorized.

from spacy.language import Languagetoken_matcher = Matcher(nlp.vocab)# snippet of the acronyms.getPatterns() method:def getPatterns():
from spacy.matcher import Matcher
for pattern in acronyms.getPatterns():
patterns = dict()
return patterns
patterns["MOTOR"] = [[{"TEXT": {"REGEX": "m[0-9]+"}}]]
patterns["POMP"] = [[{"TEXT": {"REGEX": "pa[0-9]+"}}]]
token_matcher.add(pattern, [*acronyms.getPatterns().get(pattern)])
@Language.factory("custom_ner")
def create_custom_ner_component(nlp, name):
custom_ner = EntityRuler(nlp, overwrite_ents=True) # overwrite default entities with the new ones.
custom_ner.phrase_matcher = phrase_matcher
custom_ner.matcher = token_matcher
return custom_ner
nlp.add_pipe("custom_ner", last=True)

from spacy.pipeline import EntityRuler
import NLP.acronymPatterns as acronyms

The NLP pipeline will now also recognize words in sentences that match the MOTOR or POMP regular expression. To add more categories simple add to the patterns method as shown in the snippet.

A second big part of the analysing is to find out if a sentence is either positive, neutral, or negative. This (Sentiment Analysis) can be helpful to quickly find malfunctions based on texts. Of course, the sentiment analysis is only as good as the sentiment data it has, and finding enough sentiment data can be time consuming. Luckily the library Pattern (alternatively PatternLite for a light-weight version of Pattern) supports basic sentiments of a few languages.

from pattern.text.nl import sentiment@nlp.component("sentiment")def sentimentToWord(s):
# pattern.text.en for an english sentiment file.
def sentimentAnalyse(doc):
if s >= 0.1:
return "POSITIVE"
elif s >= -0.1:
return "NEUTRAL"
return "NEGATIVE"
nlp.add_pipe("sentiment", last=True)
text = doc.text
s = sentiment(text)[0]
if not doc.has_extension("sentiment"):
doc.set_extension("sentiment", default="NEUTRAL")
doc._.sentiment = sentimentToWord(s)
return doc

Unfortunately sentiment analysis works best if the texts you’re analysing are written in perfect dutch (or what language you are analysing). Meaning that if the texts contain a lot of typo’s or abbreviations the sentiment analysis will often return Neutral. Additionally you could also write your own sentiment file, as the Pattern’s sentiments only consists of basic words and thus not professional jargon that could be in your texts.

One last part of our analysis will be dividing all of our texts into topics, where a topic is a group of keywords that are often used together and/or similar. To realise this we use a gensim LDA (Latent Dirichlet Allocation) model.
First we’ll make a list of sentences out of our data, after what we’ll use this list of sentences to fill the word Dictionary to finally create a bag of words (which can be seen as a list with every word and the amount of occurences of this word in the text). After this we simply give this bag and dictionary to the LDA model and it’ll be ready to divide all of our entries into a topic.

import gensimdef sent_to_words(sent):class TopicModel:
from gensim.corpora import Dictionary
for sentence in sent:
def __init__(self, doc, topics): # doc: the text column of your DataFrame, topics: the amount of topics you want
doc = [sentence for sentence in nlp.pipe(doc)]
Doc.set_extension("topic", default=0)
texts = list(sent_to_words(doc))
self.dictionary = Dictionary(texts)
bag = [self.dictionary.doc2bow(remove_shorts(text)) for text in texts]
self.lda_model = LdaModel(corpus=bag, num_topics=topics, id2word=self.dictionary, per_word_topics=True)
def __call__(self, doc) -> Doc:
cleaned_doc = [doc.text]
texts = list(sent_to_words(cleaned_doc))
bag = [self.dictionary.doc2bow(remove_shorts(text)) for text in texts]
topics = self.lda_model.get_document_topics(bag)[0]
best_topic = max(topics, key=lambda y: y[1])[0]
doc._.topic = best_topic
return doc
yield gensim.utils.simple_preprocess(str(sentence), deacc=True)
def remove_shorts(text):
t = []
for word in text:
if len(word) > 3:
t.append(word)
return t
from gensim.models import LdaModel
from spacy.tokens import Doc

After doing all this you will end up with all your sentences from your file being given a topic. Equal to the sentiment analysis, topics can be used to find malfunctions and/or alarming sentences.

After All this processing it’s time to export the DataFrame again so that we can start visualising the data. To do so you first have to add a few new columns to the DataFrame with new data such as the Sentiment values, Topics, and the Word Categories. Because this is simply running your text data through the NLP pipeline and adding it to the DataFrame we did not include this in the article.

At the start of the project, we had to decide on which front-end framework to use. We decided on React, with its relevancy and the lack of complexity of the webpage as main deciding factors. React projects are essentially one-pagers, and because all we needed was authentication, an upload button, and a dashboard, it seemed like a good choice over for example Angular.

We implemented our authentication with Auth0 and we can’t recommend it enough. They have implementations for almost every programming language under the sun and it works great almost straight out of the box. We made some tweaks on the Auth0 configuration portal to allow us to use active directory of the company we are making this project for and that is the authentication done. We can now log in and out of our web app.

Uploading the file

We don’t need to spend too much time on the upload page. We just want a simple web page with an upload button which is restricted to exclusively Excel files.

We also added a second button to use a pre-uploaded file, which shortens the processing time which is helpful for demo purposes.

The styling of the application wasn’t our main focus, so we decided on react-bootstrap, which makes it easier to deliver good looking, uniform components without wasting too much time on them. If you plan on using react-bootstrap in your react project, make sure you don’t mix up react-bootstrap with reactstrap, two libraries with the same goal but different ways of achieving it. We tried to find out the major differences between them, but concluded that it’s up to preference for the most part, so we stuck with react-bootstrap.

For the file upload/transfer to the back end we used Axios, as an alternative to fetch. Having never used Axios before it sounded nice to try out the best alternative for fetch according to the internet, considering the focus of this project was to learn new technologies. The difference between Axios and fetch isn’t too big, and it was quite easy to get where we wanted to be.

An example:

axios.get(backend + "/filter", {
params: {
id: result,
filter: props.props,
},
})
.then((res) => {
// ... code to put chart data in charts
}).catch();

Filtering the displayed data all in one go required an extra filter component. Initially the displayed data has a default filter, but when you press the Filter button it will create a new API call to the Python Service with given filter and return only the data that you requested.

Filtering can be useful for example when looking for malfunctions. Simply select sentiment “Negative” and you’ll be given every single entry that could be the cause of a malfunction. Of course we can filter even more and also select a Category, filtering out all the entries that do not match given Category.

An example of the filtering

axios.get(backend + "/filter", {
params: {
id: result,
filter: {
Bedrijfsplaats: "Polyether",
WoordCategorieën: "PRODUCT",
Sentiment: "POSITIVE"
},
},
})

If we could redo the project, we would probably look to start the front-end as a Dash project. For the purposes we needed, using Dash would probably provide us with the smoothest trajectory, as Dash is just a library built on top of Plotly built to make your graphing activities easier and better. Sadly, we realized this too late, and didn’t have the time to discard our whole front-end at that point and start from scratch.

Working like this has its perks though, as we learned React and a couple of its libraries, which is definitely worth it. React was harder than expected, and the React devs dropping their usual class/component-based way of working in favor of hooks makes finding good, up to date documentation harder than it should be. This in turn makes understanding hooks correctly harder than anticipated. The same goes for plotly. While the plotly.js documentation is splendid, the react-plotly documentation is lackluster at best. This project formed a good basis to learn React more in depth in the future though.

The backend with Python did however feel like the right option, as we wanted to analyse the data without spending too much time converting the data into a type that we could process with ease. And because Python has amazing libraries for this we’re more than happy with our choice.

This project was a great learning opportunity, and we are thankful that we got the opportunity to learn all these new technologies in a meaningful way by making an application that could be used by a real company with a real purpose.