DOI: 10.64336/001c.138087 ISSN: 2575-6206

Utilizing Large Language Models for text-based Industry classification

James Offutt

This study develops a novel, dynamic industry classification system, rooted in Artificial Intelligence (AI), by using Large Language Model (LLM) technology to analyze and compare firms’ product descriptions as found in Securities and Exchange Commission (SEC) 10-Q and 10-K filings. Unlike traditional static classification systems such as the Standard Industrial Classification (SIC) or the North American Industry Classification System (NAICS), the proposed method dynamically quantifies the degree of competition and customer-supplier relationships between firms. It utilized a 210x210 similarity matrix to compile the relationship scores as a starting point for further analysis. This enhanced metric strengthens the literature and aids in the identification of portfolio correlations, providing nuanced firm-to-firm insights that other methodologies have not fully captured. In turn, this assists investors in risk management and provides insights into behavioral finance by highlighting how news perception affects market dynamics. It also has potential implications on merger and acquisition strategy, supply chain analysis, and policy making. The methodology employs Ordinary Least Squares (OLS) regression and pairwise correlation analysis to evaluate the efficacy of the LLM measurement against the SIC and NAICS codes. The LLM outperformed the other methods across most models. In the few cases where it did not, the models had low observation counts, lower R2 values, and weak F-statistics. These findings indicate that the utilization of LLMs and AI as an industry classification tool is plausible and superior to the customary past measures of competitors and especially of customer-supplier identification for the majority of industry code granularities.