The official blog of BNP Paribas Asset Management

How Natural Language Processing can boost investment returns

Three papers at Inquire Europe’s 2019 investment seminar looked at the results of turning natural language into analysable data for asset managers

At Inquire Europe’s 2019 autumn seminar in Krakow, Poland, some 100 investment professionals and academics discussed “Investing with Machine Learning and New Techniques”. They examined how Natural Language Processing (NLP) can help us digest vast amounts of text simply and effectively by turning language into data.

We looked at how NLP works in our previous post. Here we report on its practical applications, discussed in presentations by Eugenio Carnemolla, from the Swiss Finance Institute at the University of Lausanne, by Alejandro Lopez-Lira, from The Wharton School of the University of Pennsylvania, and by Lin William Cong, from the University of Chicago Booth School of Business. 

What the papers revealed

Natural Language Processing applied to climate risks and stock returns

Eugenio Carnemolla and Giuseppe Vinci looked at the impact of climate risks, which can affect profitability and cash flow, on stock returns.

Focusing on US companies, they identified climate-sensitive firms using CRisk, a dictionary-based measure of risk estimated from the companies’ Form 10-K from the US Securities and Exchange Commission’s public database, EDGAR .

Using a bag-of-words model that ignores grammatical structure and word order, they first compiled a comprehensive hazard events dictionary from the SHELDUS database, a natural hazard data set.

This looked for drought*, earthquake*, eruption*, wildfire*, flooding*, hailstorm*, hurrican*, landslide*, rainfal*, winter storm*, avalanche*, storm*, typhoon*, tornado, tropical storm, tsunami* or wind. Then they created a damages and losses dictionary with words indicating that a firm’s production had been disrupted or had sustained damage or loss through natural disasters. They searched damag*, destroy*, destruct*, disrupt*, harm*, interrupt* or loss*.

Words from the second dictionary had to occur within 10 words of any term from the hazard events list to avoid generic disclaimers about sensitivity to adverse weather events.

Identifying climate-sensitive firms; source: Climate Risks and Stock Returns; Eugenio Carnemolla and Giuseppe Vinci; August 2019

The company’s geographical footprint was used to validate the text-based measure. Supervised machine learning estimated the relationship between word counts and sensitivity to climate risk. A geographic dictionary, based on state, county, and city, identified company locations to validate the text-based measure of climate risk.

Using data from 1994 to 2017, they found that a monthly rebalanced long-short factor strategy investing in climate-resilient firms and selling short climate-sensitive firms would have earned a positive and significant annual alpha of +3.72% robust to the Fama-French five factor model.

Company risk disclosures and Natural Language Processing

Alejandro Lopez-Lira’s paper, “Risk Factors that Matter: Textual Analysis of Risk Disclosures for the Cross-Section of Return”, looked at whether risk factors explain differences in companies’ returns better than existing models, such as the Fama-French five factor model, the q-model and the Stambaug model.

He focused on US companies, using risks declared by companies in their submitted annual reports on Form 10-K between 2005 to 2018. Text was simplified using lemmatisation. Then he constructed a document term matrix and used Latent Dirichlet Allocation (LDA) to generate 25 topics. The four highest-reported risk topics for 2006 were retained for the entire sample to avoid data-mining and look-ahead bias.

Risk topics; source:  Textual Analysis of Risk Disclosures for the Cross-Section of Returns; Alejandro Lopez-Lira; The Wharton School, University of Pennsylvania; Oct 2019

Using the most frequent words in each topic, Lopez-Lira identified Production Risk, International Risk, Technology Risk and Demand Risk as the four most significant. On average companies spent 36% of their risk disclosures discussing these risks and 64% on the others.

He calculated annual rebalanced strategies that invested in a market-capitalisation weighted portfolio of companies that spent more than 25% of their risk disclosures on these topics.

To test these factors as an asset pricing model he applied the four risk factors to 49 industry portfolios, 25 book-to-market portfolios and 11 anomaly portfolios and asked how well his model explained the returns and risk of such portfolios compared with the other models. His model performed better (with lower pricing errors) for the 49 industry portfolios, and slightly better for the 25 book-to-market portfolios.

Advanced textual factors and finance applications

Lin William Cong’ paper, “Textual Factors: A Scalable, Interpretable, and Data-driven Approach to Analyzing Unstructured Information”, demonstrated the use of the more sophisticated word-embedding approaches.

First, he and co-authors Tengyuan Liang and Xiao Zhang, used Word2Vec to construct the semantic and syntactic links of words in texts. Second, they used a search method known as Local Sensitive Hashing (LSH) to identify clusters of words based on their vector embedding. Finally, they built a topic model (using LDA) to enhance interpretability by examining frequency distributions over textual factors and over the supporting words within each factor. Indeed, generating topics from clusters of close-enough semantic word vectors is in line with the way humans think about topics.

They discussed how this framework could apply in finance and economics to i) making predictions and inference, ii) interpreting non-text-based models and variables, and (iii) constructing new text-based metrics and explanatory variables. Their illustrations used topics such as macroeconomic forecasting, factor asset pricing and corporate governance.


Natural Language Processing can be used to good effect in investment management, whether trying to analyse company disclosures or the impact of climate risk on stock returns.

BNP Paribas Asset Management constantly monitors developments in artificial intelligence such as NLP so that we can ensure the competitiveness of both our fundamental and quantitative (factor*) strategies.

To learn more about how NLP works, read What is Natural Language Processing and how can it help us?

2019 Inquire Europe seminar: Deep learning framework for asset pricing models

2019 Inquire Europe seminar: Machine learning and new techniques

*To learn more, download our Practical Guide to Multifactor Investing.

Any views expressed here are those of the author as of the date of publication, are based on available information, and are subject to change without notice. Individual portfolio management teams may hold different views and may take different investment decisions for different clients.

The value of investments and the income they generate may go down as well as up and it is possible that investors will not recover their initial outlay. Past performance is no guarantee for future returns.

Investing in emerging markets, or specialised or restricted sectors is likely to be subject to a higher-than-average volatility due to a high degree of concentration, greater uncertainty because less information is available, there is less liquidity or due to greater sensitivity to changes in market conditions (social, political and economic conditions).

Some emerging markets offer less security than the majority of international developed markets. For this reason, services for portfolio transactions, liquidation and conservation on behalf of funds invested in emerging markets may carry greater risk.

Related articles

Weekly insights, straight to your inbox

A round-up of this week's key economic and market trends, and insights on what to expect going forward.

Please enter a valid email
Please check the boxes below to subscribe