Climate risk is, among others, a financial risk ([4]) that however is very difficult to quantify and it requires alternative data and ML. There are, for example, no standardized environment-related disclosures from the different companies active in the market. Namely, there are companies that are taking into account climate risk in their business model, risk management and are reporting past incurred losses, future strategies, possible opportunities etc. while others are still silent about the topic or just mention possible future plans.
These information is scattered in a plethora of documents as the annual/financial statement, management letters etc. The Bloomberg terminal offers relevant information about this topic with the climate-related analysis. In particular many companies are scored with the field “ENVIRON_DISCLOSURE_SCORE” from zero to hundred (lowest to highest transparency). A starting point could be the preparation of a corpus of documents from Bloomberg for the scored companies. For each one at least the last financial statement, annual report has to be downloaded; the Bloomberg function DSCO <GO> shows the different possible documents to be used (we assume an API can be used, in case of issues let us know). Once the corpus is prepared it has to be loaded as usual by tokens, bag of words/TF-IDTF in python (e.g. [2-3]). A first step consists in performing an unsupervised analysis as the SVD (or similar algorithms) for the purpose of a latent semantic analysis. A second step is the application of the “word2vec” (or sent2vec) analysis of the documents to give still a vector representation of the content but with additional advantages as “semantic” similarity. An important result to achieve is already the descriptive analysis of the distribution of words/scores per topics for the companies with highest disclosure scores vs those with lower ones. Therefore (third step) supervised learning algorithms can be trained to reproduce Bloomberg’s environment score on the calibration sample. Different types of algorithms can be tested (whatever from random forests/boosted decision trees, NN, etc) together with the relative performance analysis. The use of the vector embedding (e.g. word2vec) is needed, moreover, to identify groups of specific key sentences, words, grams that are particularly meaningful in the text analysis for climate related disclosures. Namely, even among the highest scored companies there are different semantic contents that we want to learn to distinguish. The simplest possibility is the K-means algorithm (along the dimensions most relevant for “climate topics”). However, it is very well known that this classification strongly depends on the notion of distance used. For the purpose of improving/challenging this, a list (subsample) of “similar and dissimilar” companies will be provided and based on experts’ domain knowledge. Therefore, the algorithm GMML in [1] can be launched by the use of the additional coarse information (“similar vs dissimilar”) to train a metric on the specific data sample. The aforementioned algorithm is not the sole for metric learning, but has the following interesting advantages: a) It is an example of solvable non convex 1 optimization; b) It has a good computation time performance. The presentation of the results for the two algorithms, standard K-means vs K-means with GMML, is the successful completion of this part of project. Clearly, this is not a full schedule and possible alternative/creative solutions can be discussed around the same topics.
Direct Supervisor: Ali Hirsa
Position Dates: 6/1/2020 - 8/31/2020
Hours per Week: 20-40
Eligibility: SEAS only
Ali Hirsa, [email protected]