Natural Language Processing (NLP) is a cross-disciplinary field covering linguistics, computer science, information engineering and artificial intelligence (AI). NLP allows us to program computers to help us digest vast amounts of text simply and effectively. Here in our reports from Inquire Europe’s 2019 autumn seminar in Krakow, Poland, where some 100 investment professionals and academics discussed advances in “Investing with Machine Learning and New Techniques”, we look at how it works.
A follow-up post* will examine practical applications on NLP, as discussed in presentations at the seminar.
The most common NLP techniques start by pre-processing the text. Each document is simplified by removing elements other than words, such has punctuation, and very common words such as “the,” “a” or common verb forms like “to be”. Rare words are also removed because, although they convey meaning, they increase significantly the computational cost.
A second step is to use stemming. Words are reduced to their stem, for example replacing “economics” or “economically” by “economic”. A better alternative is lemmatisation which groups words by their lemma (the root form). This can tell the difference between “meeting” as a form of the verb “to meet” and the noun “meeting”. It also associates “better” with its lemma “good”.
The next step is to represent text as data for numerical analysis, transforming the documents into a document-term matrix. This describes how frequently words occur in each document with rows corresponding to documents and columns to words. As there is a column for every word now left in each document, the matrix is often very large and sparse. This simple representation is called bag-of-words.
However, it is also possible to use n-grams – phrases with a number of words. So “good morning” will be treated jointly rather than separately as “good” and “morning”. In practice, the number of words in each n-gram tends to be small because things become far more complicated as the number increases.
Pre-processing the text may also include PoS tagging, i.e. to mark up words with part-of-speech (PoS), categories of words with similar grammatical properties: noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection and others. Words in the same PoS play similar roles within the grammatical structure of sentences. That is why PoS tagging can be useful to understand the meaning of any sentence or to extract relationships.
Lexical Based Methods assign the PoS most frequently occurring with a word in the text whereas Rule Based Methods use rules such as words ending with “ed” or “ing” are verbs. More sophisticated approaches to PoS tagging include Probabilistic Methods which assign PoS tags based on the probability of a particular tag sequence occurring, and Deep Learning Methods.
Statistical analysis of text
How can you transform a document term matrix into something more useful? This involves mapping it onto predicted meaningful attributes such as sentiment from company news, a macro-economic variable, how frequently some event is observed or just the topics being discussed.
There exist a number of approaches:
A dictionary is created to associate words with attributes, which are predicted using word counts. For example, if the attribute is company sentiment, a dictionary can associate each word with a positive, neutral or negative sentiment. Word counts then measure whether the text discusses a company more positively or negatively. However, first someone must build the dictionary. They are readily available for sentiment. A number of dictionaries also exist for textual analysis in finance and accounting, e.g. DICTION, General Inquirer, LIWC, etc. and for many other problems. But dictionaries may not always be readily available. Creating the right dictionary, when required, can be very time consuming.
- Text regressions
Instead of relying on pre-defined dictionaries, text regression uses part of the data available to construct a model, which is built using regression analysis (a statistical method that examines the relationship between two or more variables) to evaluate a targeted attribute and the corresponding word counts. So-called “penalized linear” models such as Ridge, Lasso or Elastic Net, which limit the number of variables, are popular and easy to build.
- Generative methods
Another alternative is to use generative method models. These learn how attributes can influence word choice. Latent Dirichlet Allocation (LDA) is hugely popular for topic modelling.
LDA and similar models generate automatic summaries of the topics in the text by maximising the differences between the groups of words they create. The document term matrix can be simplified by mapping words onto a smaller number of topics and counting these in the original document. LDA does not label topics but usually this can be deduced by looking at the most frequent words in each topic proposed.
- Word-embedding approaches
Approaches based on word counts can perform extremely well in many applications. However, language structures are complex and reducing text to a simple word count inevitably means a significant amount of information is lost. Increasingly, deep-learning methods, such as word-embedding, are used. These reconstruct the linguistic context of words and capture context, as well as semantic or syntactic similarity and relationships to other words. In such methods, words are represented by vectors that treat text as an ordered sequence of transitions between words. Words that share common context have their vectors located close to each other.
Word2vec uses continuous bag-of-words (CBOW) or continuous skip-gram. The first predicts the current word from surrounding context words, but ignores their word order. Continuous skip-gram uses the current word to predict surrounding context words, giving more weight to nearby words.
Accuracy can be improved by choosing from CBOW or skip-gram, by increasing the training data set or the number of vector dimensions, or the number of surrounding words, usually five for CBOW and ten for skip-gram.
Word embedding can capture many degrees of similarity between words. Algebraic operations can make sense, e.g. “Man” - ”King” + ”Woman” produces the closest to the representation of “Queen”. Such relationships can be generated for a range of semantic relations, e.g. “Country” – “Capital”, as well as syntactic relations, e.g. present tense minus past tense.
Other more recent algorithms to word-embedding include GloVe which, unlike Word2vec, focuses on words co-occurrences with its embeddings relating to the probabilities that two words appear together, and fastText which improves over Word2vec by also taking word parts into account, enabling training of embeddings on smaller datasets and generalization to unknown words.
Advances in NLP are exciting for investors. Through the various approaches discussed in this article NLP can be used to turn vast amounts of text into insights that can significantly help portfolio managers focussing their attention on the assets with highest expect returns or on the assets with increasing risks to performance.
BNP Paribas Asset Management has long supported the use of quantitative approaches. We seek to combine such inputs with human intelligence in our investment philosophy. Indeed, we believe that when skilfully combined, human and artificial intelligence can lead to better investment decisions and improved risk management.
*To see some NLP applications to good effect in investing, read How Natural Language Processing boosts investment returns
Any views expressed here are those of the author as of the date of publication, are based on available information, and are subject to change without notice. Individual portfolio management teams may hold different views and may take different investment decisions for different clients.
The value of investments and the income they generate may go down as well as up and it is possible that investors will not recover their initial outlay. Past performance is no guarantee for future returns.
Investing in emerging markets, or specialised or restricted sectors is likely to be subject to a higher-than-average volatility due to a high degree of concentration, greater uncertainty because less information is available, there is less liquidity or due to greater sensitivity to changes in market conditions (social, political and economic conditions).
Some emerging markets offer less security than the majority of international developed markets. For this reason, services for portfolio transactions, liquidation and conservation on behalf of funds invested in emerging markets may carry greater risk.