Foundations and Trends® in Information Retrieval > Vol 19 > Issue 5

From Foundations to GPT in Text Classification: A Comprehensive Survey on Current Approaches and Future Trends

By Marco Siino, University of Catania, Italy and University of Palermo, Italy, marco.siino@unict.it | Ilenia Tinnirello, University of Palermo, Italy, ilenia.tinnirello@unipa.it | Marco La Cascia, University of Palermo, Italy, marco.lacascia@unipa.it

 
Suggested Citation
Marco Siino, Ilenia Tinnirello and Marco La Cascia (2025), "From Foundations to GPT in Text Classification: A Comprehensive Survey on Current Approaches and Future Trends", Foundations and TrendsĀ® in Information Retrieval: Vol. 19: No. 5, pp 557-711. http://dx.doi.org/10.1561/1500000107

Publication Date: 17 Apr 2025
© 2025 M. Siino et al.
 
Subjects
Natural language processing for IR,  Text mining,  Information categorization and clustering,  Topic detection and tracking,  Applications of IR,  Formal models and language models for IR,  Information extraction,  Classification and prediction,  Data mining,  Data cleaning and information extraction
 

Free Preview:

Download extract

Share

Download article
In this article:
1. Introduction
2. Tasks and Datasets
3. Preprocessing
4. Representation
5. Classification
6. Evaluation
7. Conclusion
Acknowledgements
References

Abstract

Text classification stands as a cornerstone within the realm of Natural Language Processing (NLP), particularly when viewed through computer science and engineering. The past decade has seen deep learning revolutionize text classification, propelling advancements in text retrieval, categorization, information extraction, and summarization. The scholarly literature includes datasets, models, and evaluation criteria, with English being the predominant language of focus, despite studies involving Arabic, Chinese, Hindi, and others. The efficacy of text classification models relies heavily on their ability to capture intricate textual relationships and non-linear correlations, necessitating a comprehensive examination of the entire text classification pipeline.

In the NLP domain, a plethora of text representation techniques and model architectures have emerged, with Large Language Models (LLMs) and Generative Pre-trained Transformers (GPTs) at the forefront. These models are adept at transforming extensive textual data into meaningful vector representations encapsulating semantic information. The multidisciplinary nature of text classification, encompassing data mining, linguistics, and information retrieval, highlights the importance of collaborative research to advance the field. This work integrates traditional and contemporary text mining methodologies, fostering a holistic understanding of text classification.

This monograph provides an in-depth exploration of the text classification pipeline, with a particular emphasis on evaluating the impact of each component on the overall performance of text classification models. The pipeline includes state-of-the-art datasets, text preprocessing techniques, text representation methods, classification models, evaluation metrics, and future trends. Each section examines these stages, presenting technical innovations and recent findings. The work assesses various classification strategies, offering comparative analyses, examples and case studies. These contributions extend beyond a typical survey, providing a detailed and insightful exploration of the field.

DOI:10.1561/1500000107
ISBN: 978-1-63828-558-8
166 pp. $99.00
Buy book (pb)
 
ISBN: 978-1-63828-559-5
166 pp. $160.00
Buy E-book (.pdf)
Table of contents:
1. Introduction
2. Tasks and Datasets
3. Preprocessing
4. Representation
5. Classification
6. Evaluation
7. Conclusion
Acknowledgements
References

From Foundations to GPT in Text Classification: A Comprehensive Survey on Current Approaches and Future Trends

In several Natural Language Processing (NLP) applications like news categorization, sentiment analysis, and subject labelling, text classification is a crucial and relevant task. The goal is to tag or label textual components like sentences, questions, paragraphs, and documents. In this era of massive information dissemination, manually processing and categorizing huge amounts of text data takes a relevant amount of time and effort. Text classification stands as a cornerstone within the realm of NLP, particularly when viewed through computer science and engineering. The past decade has seen deep learning revolutionize text classification, propelling advancements in text retrieval, categorization, information extraction, and summarization. The efficacy of text classification models relies heavily on their ability to capture intricate textual relationships and non-linear correlations, necessitating a comprehensive examination of the entire text classification pipeline.

This work integrates traditional and contemporary text mining methodologies, fostering a holistic understanding of text classification. In the NLP domain, numerous text representation techniques and model architectures have emerged, with Large Language Models (LLMs) and Generative pre-trained Transformers (GPTs) at the forefront. These models are adept at transforming extensive textual data into meaningful vector representations encapsulating semantic information. Text classification is multidisciplinary in nature, encompassing data mining, linguistics, and information retrieval.

This monograph provides an in-depth exploration of the text classification pipeline, with a particular emphasis on evaluating the impact of each component on the overall performance of text classification models. The pipeline includes state-of-the-art datasets, text preprocessing techniques, text representation methods, classification models, evaluation metrics, and future trends. Each section examines these stages, presenting technical innovations and recent findings. The work assesses various classification strategies, offering comparative analyses, examples and case studies. These contributions extend beyond a typical survey, providing a detailed and insightful exploration of the field.

 
INR-107