Last update: July, 21, 2021
This page contains FinText, a purpose-built financial word embedding for financial textual analysis. Dow Jones Newswires Text News Feed from January 1, 2000, to September 14, 2015 is used for developing these financial word embeddings. This contains millions of news stories (2,733,035 unique tokens) covering finance, economics, politics, etc., from various news agencies worldwide. Also, extensive text preprocessing is applied to ensure this big textual data is empty of redundant characters, sentences, and structures. Four FinText models are available to download containing Word2Vec and FastText algorithms via CBOW and Skip-gram models. For a detailed review of the model specification and their performance in realised volatility forecasting, see this paper. All available models on this page are for Non-Commercial Research Purposes.
The figure below shows the 2D visualisation of word embeddings. For each word embedding, Principal Component Analysis (PCA) is applied to 300-dimensional vectors. The chosen tokens are ''microsoft', 'ibm', 'google', and 'adobe' (technology companies), 'barclays', 'citi', 'ubs', and 'hsbc' (financial services and investment banking companies), and 'tesco' and 'walmart' (retail companies). 'Dimension 1' (x-axis) and 'Dimension 2' (y-axis) show the first and second obtained dimensions. Word2Vec and FastText algorithms are shown in the first and second rows. Google is a publicly available developed word embedding trained on a part of Google News dataset and Wiki news is another publicly available developed word embedding trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset. The Continuous Bag of Words (CBOW) and Skip-gram are the proposed supervised learning models for learning distributed representations of tokens. The expected visualisation for the best word embedding is when tokens in different company groups make clusters. This figure shows that FinText clusters all groups correctly.
Many word embeddings are able to solve word analogies such as king:man :: woman:queen (':' means 'is to' and '::' means 'as'). The table below lists some challenges we posed and the answers produced by the group of word embeddings considered here. 'NONE' indicates that one of the defined tokens is not in the vocabulary list. It is clear that FinText is more sensitive to financial context and able to capture very subtle financial relationships.
We also challenged all word embeddings to produce the top three tokens that are most similar to 'morningstar'. This token is not among the training tokens of Google. WikiNews's answers are 'daystar', 'blazingstar', and 'evenin'. Answers from FinText are 'researcher_morningstar', 'tracker_morningstar', and 'lipper'. When asked to find unmatched token in a group of tokens such as ['usdgbp', 'euraud', 'usdcad'], a collection of exchange rates mnemonics, Google and WikiNews could not find these tokens, while FinText produces the correct answer, 'euraud'.
Rahimikia, Eghbal and Zohren, Stefan and Poon, Ser-Huang, Realised Volatility Forecasting: Machine Learning via Financial Word Embedding (July 28, 2021). Available at SSRN 3895272.
This FinText word embedding is developed based on Word2Vec algorithm and CBOW model.
This FinText word embedding is developed based on Word2Vec algorithm and Skip-gram model.