Foundations and Trends® in Computer Graphics and Vision > Vol 2 > Issue 4

A Stochastic Grammar of Images

By Song-Chun Zhu, University of California, USA, sczhu@stat.ucla.edu | David Mumford, Brown University, USA, David Mumford@brown.edu

 
Suggested Citation
Song-Chun Zhu and David Mumford (2007), "A Stochastic Grammar of Images", Foundations and Trends® in Computer Graphics and Vision: Vol. 2: No. 4, pp 259-362. http://dx.doi.org/10.1561/0600000018

Publication Date: 20 Aug 2007
© 2007 S.-C. Zhu and D. Mumford
 
Subjects
Learning and statistical methods,  Statistical/machine learning,  Image and video processing,  Bayesian learning,  Graphical models
 

Free Preview:

Download extract

Share

Download article
In this article:
1. Introduction 
2. Background 
3. Visual Vocabulary 
4. Relations and Configurations 
5. Parse Graph for Objects and Scenes 
6. Knowledge Representation with And–Or Graph 
7. Learning and Estimation with And–Or Graph 
8. Recursive Top-Down/Bottom-Up Algorithm for Image Parsing 
9. Three Case Studies of Image Grammar 
10. Summary and Discussion 
Acknowledgments 
References 

Abstract

This exploratory paper quests for a stochastic and context sensitive grammar of images. The grammar should achieve the following four objectives and thus serves as a unified framework of representation, learning, and recognition for a large number of object categories. (i) The grammar represents both the hierarchical decompositions from scenes, to objects, parts, primitives and pixels by terminal and non-terminal nodes and the contexts for spatial and functional relations by horizontal links between the nodes. It formulates each object category as the set of all possible valid configurations produced by the grammar. (ii) The grammar is embodied in a simple And–Or graph representation where each Or-node points to alternative sub-configurations and an And-node is decomposed into a number of components. This representation supports recursive top-down/bottom-up procedures for image parsing under the Bayesian framework and make it convenient to scale up in complexity. Given an input image, the image parsing task constructs a most probable parse graph on-the-fly as the output interpretation and this parse graph is a subgraph of the And–Or graph after making choice on the Or-nodes. (iii) A probabilistic model is defined on this And–Or graph representation to account for the natural occurrence frequency of objects and parts as well as their relations. This model is learned from a relatively small training set per category and then sampled to synthesize a large number of configurations to cover novel object instances in the test set. This generalization capability is mostly missing in discriminative machine learning methods and can largely improve recognition performance in experiments. (iv) To fill the well-known semantic gap between symbols and raw signals, the grammar includes a series of visual dictionaries and organizes them through graph composition. At the bottom-level the dictionary is a set of image primitives each having a number of anchor points with open bonds to link with other primitives. These primitives can be combined to form larger and larger graph structures for parts and objects. The ambiguities in inferring local primitives shall be resolved through top-down computation using larger structures. Finally these primitives forms a primal sketch representation which will generate the input image with every pixels explained. The proposal grammar integrates three prominent representations in the literature: stochastic grammars for composition, Markov (or graphical) models for contexts, and sparse coding with primitives (wavelets). It also combines the structure-based and appearance based methods in the vision literature. Finally the paper presents three case studies to illustrate the proposed grammar.

DOI:10.1561/0600000018
ISBN: 978-1-60198-060-1
116 pp. $80.00
Buy book (pb)
 
ISBN: 978-1-60198-061-8
116 pp. $100.00
Buy E-book (.pdf)
Table of contents:
1. Introduction
2. Background
3. Visual vocabulary
4. Relations and configurations
5. Parse graph for objects and scenes
6. Knowledge representation with And-Or graph
7. Learning and estimation with And-Or graph
8. Recursive top-down / bottom-up algorithm for image parsing
9. Three case studies of image grammar
10. Summary and discussion
Acknowledgements
References

A Stochastic Grammar of Images

A Stochastic Grammar of Images is the first book to provide a foundational review and perspective of grammatical approaches to computer vision. In its quest for a stochastic and context sensitive grammar of images, it is intended to serve as a unified frame-work of representation, learning, and recognition for a large number of object categories. It starts out by addressing the historic trends in the area and overviewing the main concepts: such as the and-or graph, the parse graph, the dictionary and goes on to learning issues, semantic gaps between symbols and pixels, dataset for learning and algorithms. The proposal grammar presented integrates three prominent representations in the literature: stochastic grammars for composition, Markov (or graphical) models for contexts, and sparse coding with primitives (wavelets). It also combines the structure-based and appearance based methods in the vision literature. At the end of the review, three case studies are presented to illustrate the proposed grammar.

A Stochastic Grammar of Images is an important contribution to the literature on structured statistical models in computer vision.

 
CGV-018