Foundations and Trends® in Databases > Vol 1 > Issue 3

Information Extraction

By Sunita Sarawagi, Indian Institute of Technology, CSE, India, sunita@iitb.ac.in

 
Suggested Citation
Sunita Sarawagi (2008), "Information Extraction", Foundations and TrendsĀ® in Databases: Vol. 1: No. 3, pp 261-377. http://dx.doi.org/10.1561/1900000003

Publication Date: 30 Nov 2008
© 2008 S. Sarawagi
 
Subjects
Data Cleaning and Information Extraction,  Information extraction
 

Free Preview:

Download extract

Share

Download article
In this article:
1. Introduction 
2. Entity Extraction: Rule-based Methods 
3. Entity Extraction: Statistical Methods 
4. Relationship Extraction 
5. Management of Information Extraction Systems 
6. Concluding Remarks 
Acknowledgments 
References 

Abstract

The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data. The field of information extraction has its genesis in the natural language processing community where the primary impetus came from competitions centered around the recognition of named entities like people names and organization from news articles. As society became more data oriented with easy online access to both structured and unstructured data, new applications of structure extraction came around. Now, there is interest in converting our personal desktops to structured databases, the knowledge in scientific publications to structured records, and harnessing the Internet for structured fact finding queries. Consequently, there are many different communities of researchers bringing in techniques from machine learning, databases, information retrieval, and computational linguistics for various aspects of the information extraction problem.

This review is a survey of information extraction research of over two decades from these diverse communities. We create a taxonomy of the field along various dimensions derived from the nature of the extraction task, the techniques used for extraction, the variety of input resources exploited, and the type of output produced. We elaborate on rule-based and statistical methods for entity and relationship extraction. In each case we highlight the different kinds of models for capturing the diversity of clues driving the recognition process and the algorithms for training and efficiently deploying the models. We survey techniques for optimizing the various steps in an information extraction pipeline, adapting to dynamic data, integrating with existing entities and handling uncertainty in the extraction process.

DOI:10.1561/1900000003
ISBN: 978-1-60198-188-2
124 pp. $85.00
Buy book (pb)
 
ISBN: 978-1-60198-189-9
124 pp. $100.00
Buy E-book (.pdf)
Table of contents:
1. Introduction
2. Entity extraction: Rule-based methods
3. Entity extraction: Statistical methods
4. Relationship extraction
5: Management of information extraction systems
6. Concluding remarks
Acknowledgements
References

Information Extraction

Information Extraction deals with the automatic extraction of information from unstructured sources. This field has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data. The text surveys over two decades of information extraction research from various communities such as computational linguistics, machine learning, databases and information retrieval.

Information Extraction provides a taxonomy of the field along various dimensions derived from the nature of the extraction task, the techniques used for extraction, the variety of input resources exploited, and the type of output produced. It elaborates on rule-based and statistical methods for entity and relationship extraction. In each case it highlights the different kinds of models for capturing the diversity of clues driving the recognition process and the algorithms for training and efficiently deploying the models. It surveys techniques for optimizing the various steps in an information extraction pipeline, adapting to dynamic data, integrating with existing entities and handling uncertainty in the extraction process.

Information Extraction is an ideal reference for anyone with an interest in the fundamental concepts of this technology. It is also an invaluable resource for those researching, designing or deploying models for extraction.

 
DBS-003