The prominence of textual data in accounting research has increased dramatically. To assist researchers in understanding and using textual data, this monograph defines and describes common measures of textual data and then demonstrates the collection and processing of textual data using the Python programming language. The monograph is replete with sample code that replicates textual analysis tasks from recent research papers.
In the first part of the monograph, we provide guidance on getting started in Python. We first describe Anaconda, a distribution of Python that provides the requisite libraries for textual analysis, and its installation. We then introduce the Jupyter notebook, a programming environment that improves research workflows and promotes replicable research. Next, we teach the basics of Python programming and demonstrate the basics of working with tabular data in the Pandas package.
The second part of the monograph focuses on specific textual analysis methods and techniques commonly used in accounting research. We first introduce regular expressions, a sophisticated language for finding patterns in text. We then show how to use regular expressions to extract specific parts from text. Next, we introduce the idea of transforming text data (unstructured data) into numerical measures representing variables of interest (structured data). Specifically, we introduce dictionary-based methods of (1) measuring document sentiment, (2) computing text complexity, (3) identifying forward-looking sentences and risk disclosures, (4) collecting informative numbers in text, and (5) computing the similarity of different pieces of text. For each of these tasks, we cite relevant papers and provide code snippets to implement the relevant metrics from these papers.
Finally, the third part of the monograph focuses on automating the collection of textual data. We introduce web scraping and provide code for downloading filings from EDGAR.
Using Python for Text Analysis in Accounting Research provides an interactive step-by-step framework for analyzing spoken or written language for faculty and PhD students in social sciences. The goal is to demonstrate how textual analysis can enhance research by automatically extracting new and previously unknown information from voluminous disclosures, news articles, and social media posts. Materials are presented in a way that allows the reader to learn about a textual analysis concept or technique and also replicate it by doing.
The monograph begins by showing how to install and use Python, a popular general purpose programming language, reviewing Python’s basic programming syntax, operators, data types, functions, and so on; allowing the readers to familiarize themselves with the programming environment first. It discusses the Jupyter notebook, which is an open-source web application that allows creating, running, and testing your Python code interactively. And the monograph introduces the Pandas package for working with tabular data that aids researchers as they convert unstructured textual data into structured, tabular data. The authors introduce regular expressions which represent patterns for matching different elements in texts. They then proceed with the discussion and coding of different textual analysis methods used in accounting and finance studies. Finally, the monograph provides an overview of web scraping and file processing features in Python with a focus on downloading EDGAR filings and identifying specific sections in them.
Taken together, the first five chapters of this monograph will help readers get started with Python and prepare for writing their own code.