Download - 03 Martinelli
-
7/30/2019 03 Martinelli
1/61
MSc. Thesis Scene layout segmentation of trafficenvironments using a Conditional Random Field
Fernando Cervigni Martinelli
Honda Research Institute Europe GmbH
A Thesis Submitted for the Degree ofMSc Erasmus Mundus in Vision and Robotics (VIBOT)
2010
-
7/30/2019 03 Martinelli
2/61
Abstract
At least 80% of the traffic accidents in the world are caused by human mistakes. Whetherdrivers are too tired, drunk or speeding, most accidents have their root in the improper behaviorof drivers. Many of these accidents could be avoided if cars were equipped with some kind ofintelligent system able to detect inappropriate actions of the driver and autonomously interveneby controlling the car in emergency situations. Such an advanced driver assistance systemneeds to be able to understand the car environment and, from that information, predict the
appropriate behavior of the driver at every instant. In this thesis project we investigate theproblem of scene understanding solely based on images from an off-the-shelf camera mountedto the car.
A system has been implemented that is capable of performing semantic segmentation andclassification of road scene video sequences. The object classes which are to be segmented canbe easily defined as input parameters. Some important classes for the prediction of the driverbehavior include road, sidewalk, car and building, for example. Our system is trained ina supervised manner and takes into account information such as color, location, texture andalso spatial context between classes. These cues are integrated within a Conditional RandomField model, which offers several practical advantages in the domain of image segmentationand classification. The recently proposed CamVid database, which contains challenging inner-city road video sequences with very precise ground truth segmentation data, has been used forevaluating the quality of our segmentation, including a comparison to state-of-the-art methods.
Everything should be made as simple as possible, but not simpler . . .
Albert Einstein
-
7/30/2019 03 Martinelli
3/61
Contents
Acknowledgments v
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Problem definition 4
2.1 Combined segmentation and recognition . . . . . . . . . . . . . . . . . . . . . . . 4
3 State of the art 6
3.1 Features for image segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.1 Spatial prior knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.2 Sparse 3D cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.3 Gradient-based edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.4 Color distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.5 Texture cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.6 Context features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Probabilistic segmentation framework . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.1 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 Energy minimization for label inference . . . . . . . . . . . . . . . . . . . 13
i
-
7/30/2019 03 Martinelli
4/61
3.3 Example: TextonBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1 Potentials without context . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.2 Texture-layout potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Application to road scenes (Sturgess et al.) . . . . . . . . . . . . . . . . . . . . . 20
4 Methodology 21
4.1 CRF framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Basic model: location and edge potentials . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Texture potential model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.1 Feature vector and choice of filter bank . . . . . . . . . . . . . . . . . . . 24
4.3.2 Boosting of feature vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.3 Adaptive training procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Texture-layout potential model (context) . . . . . . . . . . . . . . . . . . . . . . . 31
4.4.1 Training procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.2 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Results 37
5.1 Model without context features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Model with context features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.1 Influence of number of weak classifiers . . . . . . . . . . . . . . . . . . . . 38
5.2.2 Influence of the different model potentials . . . . . . . . . . . . . . . . . . 39
5.2.3 Influence of 3D features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 CamVid sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 Comparison to state of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6 Conclusions 49
Bibliography 54
ii
-
7/30/2019 03 Martinelli
5/61
List of Figures
2.1 Example of ideal segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1 3D features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Gradient-based edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 GrabCut: segmentation using color GMMs and user interaction . . . . . . . . . . 9
3.4 The Leung-Malik (LM) filter bank . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5 Clique layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.6 Sample results of TextonBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.7 Image textonization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.8 Texture-layout filters (context) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.9 Sturgess: higher-order cliques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1 Examples of location potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Intuitive example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Filter bank resp onses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 The MR8 filter bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 3D features interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.6 Adaboost training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.7 Adaboost classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.8 Software architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iii
-
7/30/2019 03 Martinelli
6/61
5.1 Confusion matrix without adaptive training . . . . . . . . . . . . . . . . . . . . . 38
5.2 Confusion matrix after adaptive training . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Example of texture-layout features (context) . . . . . . . . . . . . . . . . . . . . 39
5.4 Influence of number of weak classifiers . . . . . . . . . . . . . . . . . . . . . . . . 40
5.5 Influence of the different potentials . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.6 Confusion matrices for 4-class segmentation . . . . . . . . . . . . . . . . . . . . . 44
5.7 Example of segmentations for 4-class set . . . . . . . . . . . . . . . . . . . . . . . 45
5.8 Results for 11-class set segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.9 Example of segmentations for 11-class set . . . . . . . . . . . . . . . . . . . . . . 47
5.10 Comparison to state of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.1 Adaptive scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
iv
-
7/30/2019 03 Martinelli
7/61
Acknowledgments
I would like to thank above all my family for the constant support. They are always with me,
even though they live on the other side of the Atlantic ocean.
My heartly thanks to my supervisors at Honda, Jannik Fritsch, who has been so nice and
given me all the support I needed, and Martin Heracles, who has carefully revised this thesis
report and given precious advice all along these four months. For his help with the iCub
repository and for providing me with his essential CRF code, I would like to sincerely thank
Andrew Dankers.
I wish also to thank my supervisor, Prof. Fabrice Meriaudeau, and all professors of the
Vibot Masters. It is hard to fathom how much I learned with you during these 2 years. Thanks
also for offering this program, which has been an amazing and unforgettable experience.
Last but not least, I wish to thank all my Vibot mates, who have been a great company
studying before exams or chilling at the bar.
v
-
7/30/2019 03 Martinelli
8/61
Chapter 1
Introduction
1.1 Motivation
Within the Honda Research Institute Europe (HRI-EU), the Attentive Co-Pilot project (ACP)
conducts research on a multi-function Advanced Driver Assistance System (ADAS). It is desired
and to be expected that, in the future, cars will autonomously respond to inappropriate actions
taken by the driver. If he or she does not stop the car when the traffic lights are red or fallsasleep and slowly deviates from the normal driving course, the car should trigger an emergency
procedure and warn the driver. A similar warning should come up, for example, if the driver
gets distracted and the car in front inadvertently brakes, without the driver noticing it. It would
be even safer if the car had the capability of not only recognizing it and warning the driver, but
also of taking over control in critical situations and safely correcting the drivers inappropriate
actions. Since human mistakes, and not technical problems, are by far the main cause of traffic
accidents, countless lives could be saved and much damage avoided if such reliable advanced
driver assistance systems existed and were widely implemented.
If, however, this Advanced Driver Assistance System is to become responsible for saving
lives, in a critical real-time context, it cannot afford to fail. In order to manage the extremelychallenging task of building such an intelligent system, many smaller problems have to be
successfully tackled. One of the most important is related to understanding and adequately
representing the environment in which the car operates. For that, a variety of sensors and input
data can be used. Indeed, participants of the DARPA Urban Challenge [5], which requires
autonomous vehicles to drive through specific routes in a restricted city environment, rely on
a wide range of sensors such as GPS, Radar, Lidar, inertial guidance systems as well as on the
use of annotated maps.
1
-
7/30/2019 03 Martinelli
9/61
Chapter 1: Introduction 2
One of our aspirations, though, is to achieve the task of scene understanding by visual
perception alone, using an off-the-shelf camera mounted in the car. We humans prove in our
daily life as drivers that seeing the world is largely sufficient to achieve an understanding of the
traffic environment. By ruling out the use of complicated equipment and sensing techniques, we
aim at, once a reliable driver assistance system is achieved, manufacturing it cheap enough for
it to be highly scalable. Considering their great potential of increasing the safety of drivers
and therefore also of pedestrians, bicyclists, and other traffic participants, such advanced
driver assistance systems will most likely become an indispensable car component, like todays
seat-belts.
1.2 Goal
A first step to understanding and representing the world surrounding the car is to segment
the images acquired by the camera in meaningful regions and objects. In our case, meaningful
regions are understood as the regions that are potentially relevant for the behavior of the
driver. Examples of such regions are the road, sidewalks, other cars, traffic signs, pedestrians,
bicyclists and so on. In contrast, in our context it is not so important, for example, to segment
and distinguish a building on the side of the road as an individual class, since, as far as the
driver behavior is concerned, it makes no difference whether there is a building, a fence or even
a tree at that location.
In order to correctly segment such meaningful regions, we need to consider semantic aspects
of the scene rather than only its appearance, that is, even if the road consists of dark and bright
regions because of shadows, it should still be segmented as only one semantic region. This can
be achieved by supervised training using ground truth segmentation data.
The work described in this thesis aims at performing this task of semantic segmentation,
exploring the most recent insights of researchers in the field, as well as well-known and state-
of-the-art image processing and segmentation techniques.
1.3 Thesis outline
This thesis is structured in five more chapters. In Chapter 2, the main goal of the investigation
done in this thesis project is fomalized and explained. Chapter 3 investigates the state of the art
in the field of semantic segmentation and road scene interpretation. Cutting-edge algorithms
like TextonBoost are described in greater detail as they are fundamental to state-of-the-art
methods. In Chapter 4, the methodology and implementation steps followed throughout this
thesis project are detailed. Chapter 5 shows the results obtained for the CamVid database, both
-
7/30/2019 03 Martinelli
10/61
3 1.3 Thesis outline
for a set of four classes and for a set of eleven classes. A comparison of these results and to the
state of the art mentioned in Chapter 3 is also shown. Finally, in Chapter 6 the conclusions
of the thesis are presented and suggestions regarding the areas on which future efforts should
focus are given.
-
7/30/2019 03 Martinelli
11/61
Chapter 2
Problem definition
2.1 Combined segmentation and recognition
The main goal of this thesis project is to investigate, implement and evaluate a system that per-
forms segmentation of road scene images including a classification of the different object classes
involved. More specifically, each input color image x GMN3, where G = {0, 1, 2, , 255}
and M and N are the image height and width, respectively, must be pixel-wise segmented.
That means that each pixel i of the image has to be assigned one of N pre-defined classes or
labels, of a set L = {l1, l2, l3, , lN}. In mathematical terms, the segmentation investigated
can be defined as a function f that takes the color image x GMN3 and returns a label
image f(x) = y LMN, also called a labeling of x:
This is achieved by supervised training, which means that the system is given labeled training
images, from which it should learn in order to subsequently segment new, unseen images.
According to the state of the art researchers, supervised segmentation techniques yield better
results than unsupervised techniques (see Chapter 3). This is not surprising, since unsupervised
segmentation techniques do not have ground truth information from which to learn semantic
properties, hence can only segment the images based on purely data-driven features.Figure 2.1 shows a typical inner-city road scene as considered in this thesis project, as well
an ideal segmentation, obtained by manual annotation. The example is taken from the CamVid
database [4], which is a recently proposed image database with high-quality, manually-labeled
ground truth which we use for training our system. The images have been acquired by a car-
mounted camera, filming the scene in front of the car while driving in a city. More detail about
the CamVid dataset is given in Chapter 5.
Theoretically, it would be ideal if the segmentation algorithm proposed could precisely
4
-
7/30/2019 03 Martinelli
12/61
5 2.1 Combined segmentation and recognition
(a) (b)
Figure 2.1: (a) An example of a typical inner-city road scene extracted from the CamViddatabase. (b) The corresponding manually labeled ground truth, taking into account classeslike road, pedestrian, sidewalk and sky, among others. The goal of the segmentationsystem to be implemented is to produce, given an image (a), an automatic segmentation thatis as close as possible to the ground truth (b).
segment all 32 classes annotated in the CamVid database. However, the more classes one tries
to segment the more challenging and time-consuming the problem becomes. Although our
system is supposed to be able to segment an arbitrary set of classes, as long as they are present
in the training database, a compromise between computational efficiency and the number of
classes to segment has to be reached. More importantly, many of the classes defined in the
CamVid database have little if any influence at all on the behaviour of the driver. Bearing
this in mind, the segmentation algorithm should be optimized and tailored towards the most
behaviorly relevant classes.
Furthermore, a related study recently conducted in the ACP Group suggests that, in order
to achieve a good prediction of the drivers behavior, more effort should be invested in how to
use such a segmentation of meaningful classes in terms of segmentation-based features rather
than in precisely segmenting a vast number of classes that may not influence, after all, how the
driver controls the car [11].
-
7/30/2019 03 Martinelli
13/61
Chapter 3
State of the art
The problem of image segmentation has been focus, already for some decades, of countless
image processing researchers around the globe. Although the problem itself is old, the solution
to many segmentation tasks remains, still today, under active investigationin particular for
image segmentation applied to highly complex real-world scenes (e.g. traffic scenes). This
chapter describes some of the techniques for image segmentation that have been applied in
related areas to the one investigated in this thesis project.
3.1 Features for image segmentation
3.1.1 Spatial prior knowledge
One of the simplest but useful cues that may be explored when segmenting images in a super-
vised fashion is the location information of objects in the scene. For many object classes, there
is an important correlation between the label to which a region in an image belongs and its
location on the image. For instance, the fact that the road is mostly at the lower part of pictures
could be helpful for its segmentation. The same applies for the segmentation of the sky, which
is normally at the upper part of an image. Many similar examples can be mentionedlike
buildings being usually on the sides of the imagewhich makes this feature powerful, despite
its simplicity.
3.1.2 Sparse 3D cues
Different regions in an image have often different depths. Therefore, if available, the information
of how far each point in the image was from the camera when the image was acquired can be very
6
-
7/30/2019 03 Martinelli
14/61
7 3.1 Features for image segmentation
Figure 3.1: The algorithm proposed by Brostow et al. uses 3D point clouds estimated fromvideos sequences and performs, using motion and structure features, a very satisfactory 11-classsemantic segmentation.
useful for segmentation purposes. Since individual images do not carry any depth information,
3D cues can only be explored in specific cases where one can either measure or infer how far
the objects in an image are. If the use of radars or equipment that directly measure distance
is to be discarded, 3D information can be inferred by using a stereo camera set or, in the case
of a single camera, by using structure-from-motion techniques [9]. When dealing with images
taken from an ordinary video frame, structure-from-motion techniques must be applied.
Figure 3.1, extracted from the work of Brostow et al. [3], shows how accurate the segmen-
tation of road scenes can get only by using reconstructed 3D point clouds.
Brostow et al. based their work on the following features, which can be extracted from the
sparse 3D point cloud:
Height above the ground;
Shortest distance from cars path;
Surface orientation of groups of points;
Residual error, which is a measure of how objects in the scene move in respect to the
world;
Density of points.
3.1.3 Gradient-based edges
When ones thinks of image segmentation, it is natural to expect that the label boundaries
correspond to strong edges on the image being segmented. For example, the image of a blue
car on a city street will have rather strong edges where, in a perfect labeling, the boundaries
between the labels car and street are located. Some methods, like, for example, active contour
snakes [13], explore gradient-based edge information for segmentation. Figure 3.2 shows an
-
7/30/2019 03 Martinelli
15/61
Chapter 3: State of the art 8
(a) (b)
Figure 3.2: (a) Original grayscale image of Lena. (b) Edge image obtained by calculating theimage gradients. Edge based segmentation methods explore the information in (b) to proposea meaningful segmentation of (a). Note how Lenas hat, face and shoulder could be quite wellsegmented only with this edge cue.
example picture of Lena and its gradient. The white pixels have a greater probability of being
located on boundaries between labels in a segmentation.
Notice that although this is a very reasonable and useful cue, it can also turn out to be
misleading. When dealing, for example, with shadowed scenes, very often there are stronger
edges inside regions that belong to the same label than there are on the boundaries betweenlabels. This is particularly challenging for real-world scenes such as the traffic scenes considered
in this thesis project. The way this cue was explored in this project is explained in detail in
Section 3.3.1.
3.1.4 Color distribution
Early methods, like [20] tackle the problem of image segmentation by relying solely on color
features, which can be modeled as histogram distributions or by Gaussian Mixture Models
(GMMs). A Gaussian Mixture Model represents a probability distribution, P(x), which is
obtained by summing different Gaussian distributions:
P(x) =k
Pk(x) (3.1)
where
Pk(x) = N(x|k, k) (3.2)
k, k being the mean and variance of the individual Gaussian distribution k.
The use of GMMs to model colors in images has also proven very efficient in binary seg-
-
7/30/2019 03 Martinelli
16/61
9 3.1 Features for image segmentation
Figure 3.3: Segmentation achieved by GrabCut using color GMMs and user interaction.
mentation problems, as shown by Rother et al [19] with their GrabCut algorithm. In suchproblems, one wants to separate a foreground object from the background for image editing,
object recognition and so on. When possible, user interaction can be very useful to refine the
results by giving important feedback after the initial automatic segmentation (see Figure 3.3).
However, in most cases, either the number of images to segment is prohibitive or the real
time nature of the segmentation task prevents any user interference at all. These both remarks
are true in the field of traffic scenes segmentation for driver assistance.
3.1.5 Texture cues
Along with color, texture information is often considered and can bring significant improvementto the segmentation accuracy, as in [7], where graylevel texture features were combined to color
ones. Nowadays, most if not all the research effort on segmentation also incorporates texture
information. This can be extracted and modeled in two main ways:
1. Statistical Models, which try to describe the statistical correlation between pixel colors
within a restricted vicinity. Among such methods, co-occurrence matrices, have been
successfully used, for instance, for seabed classification [18];
-
7/30/2019 03 Martinelli
17/61
Chapter 3: State of the art 10
Figure 3.4: The LM filter bank has a mix of edge, bar and spot filters at multiple scales andorientations. It has a total of 48 filters2 Gaussian derivative filters at 6 orientations and 3scales, 8 Laplacian of Gaussian filters and 4 Gaussian filters.
2. Filter bank convolution, where the image is convolved with a carefully selected set of filter
primitives, usually composed of Gaussians, Gaussian derivatives and Laplacians. A well
known example, the Leung-Malik (LM) filter bank [16], is shown in Figure 3.4. It is
interesting to mention that such filter banks have similarities with the receptive fields of
neurons in the human visual cortex.
3.1.6 Context features
Although color and texture may efficiently characterize image regions, they are far from enough
for a high quality semantic segmentation if considered alone. For instance, even humans may
not be able to tell apart, when looking only at a local patch of an image, a blue sky from the
walls of a blue building. The key aspect of which humans naturally take advantage, and that
allows them to unequivocally understand scenes, is the context. Even if one sees a building
wall painted with exactly the same color as the sky, one just knows that that wall cannot be
the sky because it is surrounded by windows. In the case of road scenes segmentation, typical
spatial relationships between objects can be a very strong cuefor example, the fact that the
car is always on the road, which, in turn, is usually surrounded by sidewalks.
With this in mind, computer vision researchers are now frequently looking beyond low-level
features and are more interested in contextual issues [7, 10, 14]. In Section 3.3, an example ofhow context in images can be exploited for segmentation is described.
3.2 Probabilistic segmentation framework
The choice of image features, described in the previous section, is independent of the theoretical
framework or machine learning technique applied for segmentation inference. One can choose
the very same features as in [7], where belief networks are used, and process them using Support
-
7/30/2019 03 Martinelli
18/61
11 3.2 Probabilistic segmentation framework
Vector Machines, for example. In recent years, Conditional Random Fields (CRFs) have played
an increasingly central role. CRFs have been introduced by Lafferty et al. in [15] and have ever
since been systematically used in cutting-edge segmentation and classification approaches like
TextonBoost [21], image sequence segmentation [27], contextual analysis of textured scenes [24]
and traffic scene understanding [22], to name a few. Conditional Random Fields are based on
Markov Random Fields and offer practical advantages for image classification and segmenta-
tion. These advantages are explained in the next section, after the formal definition of Markov
Random Fields is given.
3.2.1 Conditional Random FieldsIn the Random Field theory, an image can be described by a lattice S composed of sites i,
which can be thought of as the image pixels. The sites in S are related to one another via
a neighborhood system, which is defined as N = {Ni, i S}, where Ni is the set of sites
neighbouring i. Additionally, i / Ni and i Nj j Ni.
Let y denote a labeling configuration of the lattice S belonging to the set of all possible
labelings Y. In the image segmentation context, y can be seen as a labeling image, where each
of the sites (or pixels) i from the lattice S is assigned one label yi in the set of possible labels
L = {l1, l2, l3, , lN}, which are the object classes. The pair (S,N) can be referred to as a
Random Field.
Moreover, (S,N) is said to be a Markov Random Field (MRF) if and only if
P(y) > 0, y Y, and (3.3)
P(yi|yS{i}) = P(yi|yNi) (3.4)
That means, firstly, that the probability of any defined label configuration must be greater
than zero1 and, secondly and most importantly, that the probability of a site assuming a given
label just depends on its neighboring sites. The latter statement is also known as the Markov
condition.
According to the Hammersley-Clifford theorem [1], an MRF like defined above can equiv-
alently be characterized by a Gibbs distribution. Thus, the probability of a labeling y can be
written as
P(y) = Z1 exp(U(y)), (3.5)
where
Z =yY
exp(U(y)) (3.6)
1This assumption is usually taken for convenience, as it, in practical terms, does not influence the problem.
-
7/30/2019 03 Martinelli
19/61
Chapter 3: State of the art 12
is a normalizing constant called the partition function, and U(y) is an energy function of the
form
U(y) =cC
Vc(y). (3.7)
C is the set of all possible cliques and each clique c has a clique potential Vc(y) associated
with it. A clique c is defined as a subset of sites in S in which every pair of distinct sites
are neighbours, with single-site cliques as a special case (see Figure 3.5). Due to the Markov
condition, the value of Vc(y) depends only on the local configuration of clique c.
(a) (b) (c)
Figure 3.5: (a) Example of a 4-pixel neighborhood. (b) Possible unary clique layout. (c)Possible binary clique layouts.
Now let us consider the observation xi, for each site i, which is a state belonging to a set of
possible states W = {w1, w2, , wn}. In this manner, we can represent the image we want to
segment, where each pixel i is assigned to one state of the set W. If one thinks of a gray scale
image with 8 bit-resolution, for example, the set of possible states for each site (or pixel) would
be defined as W = {0, 1, 2, , 255}. The segmentation problem then boils down to finding the
labeling y such that P(y|x)the posterior probability of labeling y given the observation
xis maximized. Bayes theorem tells us that
P(y|x) = P(x|y)P(y)/P(x) (3.8)
where P(x) is a normalization factor, as Z in Eq. 3.5, and plays no role in the maximization.
Thanks to the Hammersley-Clifford theorem, one can greatly simplify this maximization prob-lem by defining only locally the clique potential functions Vc(x,y,). How to choose the forms
and parameters of the potential functions for a specific application is a major topic in MRF
modeling and will be further discussed in Chapter 4.
The main difference between MRFs and CRFs lies on the fact that MRFs are generative
models, whereas CRFs are discriminative. That is, CRFs directly model the posterior distri-
bution P(y|x) while MRFs learn the underlying distributions P(x|y) and P(y), arriving at the
posterior distribution by applying the Bayes theorem.
-
7/30/2019 03 Martinelli
20/61
13 3.2 Probabilistic segmentation framework
In other words, for MRFs, the learned state-label joint probability is represented as P(y|x) =
P(x|y)P(y)/P(x)), where x represents the observation and y the corresponding labeling con-
figuration. However, for CRFs, it is not required to generate prior distributions over the labels
P(x|y) like for MRFs, as the a posteriori P(y|x) is modeled directly.
This directly modeled posterior probability is simpler to implement and usually sufficient for
segmenting images. Hence, for the road scene segmentation and classification problem at hand,
CRFs are advantageous in comparison to MRFs. This is the main reason why they became so
popular [21,22,27].
3.2.2 Energy minimization for label inference
Finding the labeling y that maximizes the a posteriori probability expressed in Eq. 3.5 is
equivalent to finding y that minimizes the energy function in Eq. 3.7. An efficient way of
finding a good approximation of the energy minimum of such functions, is the alpha-expansion
graph-cut algorithm [2] which widely used along with MRFs and CRFs. The idea of the alpha-
expansion algorithm is to reduce the problem of minimizing a function like U(y) with multiple
labels to a sequence of binary minimization problems. These sub-problems are referred to as
alpha-expansions, and will be shortly described for completeness (for details see [2]).
Suppose that we have a current image labeling y and one randomly chosen label L =
{l1, l2, l3, , lN}. In the alpha-expansion operation, each pixel i makes a binary decision: it
can either keep its old label yi or switch to label , provided that this change decreases the
value of energy the function. For that, we introduce a binary vector s {0, 1}MN which
indicates which pixels in the image (of size M N) keep their label and which switch to label
. This defines the auxiliary configuration y[s] as
yi[s] =
yi, if si = 0, if si = 1 (3.9)
This auxiliary configuration y[s] transforms the function U with multiple labels into a func-
tion of binary variables U(s) = U(y[s]). If function U is composed of attractive potentials,
which could be seen as a kind of convex functions, the global minimum of this binary function 2
is guaranteed to be found exactly using standard graph cuts [21].
The expansion move algorithm starts with any initial configuration y0, which could be set,
for instance, taking, for each pixel, the label with maximum location prior probability3 . It
then computes optimal alpha-expansion moves for labels in a random order, accepting the
2Notice that this does not mean that the global minimum of the multi-label function is found.3In the road scene segmentation case, for instance, pixels on top of the image could start with label sky and
pixels at the bottom with label road. This is equivalent to exploring the features described in 3.1.1
-
7/30/2019 03 Martinelli
21/61
Chapter 3: State of the art 14
moves only if they decrease the energy function. The algorithm is guaranteed to converge, and
its output is a strong local minimum, characterized by the property that no further alpha-
expansion can decrease the value of function U.
3.3 Example: TextonBoost
One CRF-based approach to image segmentation that is currently fundamental for state-of-
the-art methods is TextonBoost [21]. In their research, Shotton et al. have used the Microsoft
Research Cambridge (MSRC) database4, which is composed of 591 photographs of the following
21 object classes: building, grass, tree, cow, sheep, sky, airplane, water, face, car, bicycle, flower,sign, bird, book, chair, road, cat, dog, body, and boat. Approximately half of those pictures
is picked for training, in a way that ensures proportional contributions from each class. Some
results of their semantic segmentation on previously unseen images are shown in Figure 3.6.
Figure 3.6: TextonBoost results extracted from [21]. Above, unseen test images. Below, seg-mentation using a color-coded labeling. Textual labels are superimposed for better visualization.
Since the algorithm implemented for the segmentation of road scenes in this master thesis
has been mainly inspired by TextonBoost, a short description of the way it works is provided.
The inference framework used is a conditional random field (CRF) model [15]. The CRF
learns, through the training of the parameters of the clique potentials, the conditional distribu-
tion over the possible labels given an input image. The use of a conditional random field allows
the incorporation of texture, layout, color, location, and edge cues in a single, unified model.
The energy function U(y|x, ), which is the sum of all the clique potentials (see Eq. 3.7), isdefined as:
U(y|x, ) =i
location (yi, i; ) +
color (yi, xi; ) +
texturelayout i(yi, x; )
+
(i,j)
edge (yi, yj, gij(x); ) (3.10)
where y is the labeling or segmentation and x is a given image, is the set of edges in a
4The MSRC database can be downloaded at http://research.microsoft.com/vision/cambridge/recognition/
-
7/30/2019 03 Martinelli
22/61
15 3.3 Example: TextonBoost
4-connected neighborhood, = {, , , } are the model parameters, and i and j index
pixels in the image, which correspond to sites in the lattice of the Conditional Random Field.
Notice that the model consists of three unary potentials, which depend only on one site i in
the lattice, and one pairwise potential, depending on pairs of neighboring sites.
Each of the potentials is subsequently explained in a simplified way, for details please see [21].
3.3.1 Potentials without context
Location potential
The unary location potentials (yi, i; ) capture the correlation of the class label and the abso-lute location of the pixel in the image. For the databases with which TextonBoost was tested,
the location potentials had a rather low importance since the context of the pictures is very
diverse. In the case of our road scene segmentation, which is a more structured environment,
they have had significantly more relevance, as discussed in Chapter 5.
Color potential
In TextonBoost, the color distributions of object classes are represented as Gaussian Mixture
Models (see Section 3.1.4) in CIELab color space where the mixture coefficients depend on the
class label. The conditional probability of the color x of a pixel labeled with class y is given by
P(x|y) =k
P(x|k)P(k|y) (3.11)
with color clusters (mixture components) P(x|k). Notice that the clusters are shared between
different classes, and that only the coeficients P(k|y) depend on the class label. This makes
the model more eficient to learn than a separate GMM for each class, which is important since
TextonBoost takes into account a high number of classes.
Edge potential
The pairwise edge potentials have the form of a contrast sensitive Potts model [2],
(yi, yj, gij(x); ) = T gij(x)[yi = yj ], (3.12)
with [] the zero-one indicator function:
[condition] =
1, if condition is true0, otherwise (3.13)
-
7/30/2019 03 Martinelli
23/61
Chapter 3: State of the art 16
The edge feature gij measures the difference in color between the neighboring pixels, as sug-
gested by [19],
gij =
exp(xi xj
2
1
(3.14)
where xi and xj are three-dimensional vectors representing the CIELab colors of pixels i and j
respectively. Including the unit element allows a bias to be learned, to remove small, isolated
regions5. The quantity is an image-dependent contrast term, and is set separately for each
image to (2xi xj2)1, where denotes an average over the image. The two scalar
constants that compose the parameter vector are appropriately set by hand.
3.3.2 Texture-layout potential
The texture-layout potential is the most important contribution of TextonBoost. It is based on
a set of novel features which are introduced in [21] as texture-layout filters. These new features
are capable of, at once, capturing the correlation between texture, spatial layout, and textural
context in an image.
Here, we quickly describe how the texture-layout features are calculated and the boosting
approach used to automatically select the best features and, thereby, learn the texture-layout
potentials used in Eq. 3.10.
Image textonization
As a first step, the images are represented by textons [17] in order to arrive at a compact
representation of the vast range of possible appearances of objects or regions of interest 6. The
process of textonization is depicted in Figure 3.7, and proceeds as follows. At first, each of the
training images is convolved with a 17-dimensional filter bank. The responses for all training
pixels are then whitenedso that they have zero mean and unit covarianceand clustered
using a standard Euclidean-distance K-means clustering algorithm for dimension reduction.
Finally, each pixel in each image is assigned to the nearest cluster center found with K-means,
producing the texton map T, where pixel i has value Ti {1,...,K}.
Texture-Layout Filters
The texture-layout filter is defined by a pair (r, t) of an image region, r, and a texton t, as
illustrated in Figure 3.8. Region r is relatively referenced to the pixel i being classified and
texton t belongs to the texton map T. For efficiency reasons, only rectangular regions are
5The unit element means that for every pair of pixels that have different labels a constant potential is addedto the whole. This makes contiguous labels preferable when the energy function is minimized.
6Textons have been proven effective in categorizing materials [25] as well as generic object classes [28].
-
7/30/2019 03 Martinelli
24/61
17 3.3 Example: TextonBoost
Figure 3.7: The process of image textonization, as proposed by [21]. All training images areconvolved with a filter-bank. The filter responses are clustered using K-means. Finally, each
pixel is assigned a texton index corresponding to the nearest cluster center to its filter response.
implemented by TextonBoost, although any arbitrary region shape could be considered. A set
R of candidate rectangles is chosen at random, such that every rectangle lies inside a fixed
bounding box.
The feature response at pixel i of texture-layout filter (r, t) is the proportion of pixels under
the offset region r + i that have been assigned texton t in the textonization process,
v[r,t](i) =1
area(r)
j(r+i)
[Tj = t] . (3.15)
Any part of the region r + i that lies outside the image does not contribute to the feature
response.
An efficient and elegant way to calculate the filter responses anywhere over an image can
be achieved with the use of integral images [26]. For each texton t in the texton map T, a
separate integral image I(t) is calculated. In this integral image, the value at pixel i = (ui, vi)
is defined as the number of pixels in the original image that have been assigned to texton t in
the rectangular region with top left corner at (1, 1) and bottom right corner at (ui, vi):
I(t)(i) =
j:(ujui)&(vjvi)[Tj = t] . (3.16)
The advantage of integral images is that they can later be used to compute the texture-
layout filter responses in constant time: if I(t) is the integral image for texton channel t defined
like above, then the feature response is computed as:
v[r,t](i) =
I(t)(rbr) I(t)(rbl) I
(t)(rtr) + I(t)(rtl)
/area(r) (3.17)
where rbr, rbl, rtr and rtl denote the bottom right, bottom left, top right and top left corners
-
7/30/2019 03 Martinelli
25/61
Chapter 3: State of the art 18
(a) (c) (e)
(b) (d) (f)
Figure 3.8: Graphical explanation of texture-layout filters extracted from [21]. (a, b) An imageand its corresponding texton map (colors represent texton indices). (c) Texture-layout filtersare defined relative to the point i being classified (yellow cross). In this first example feature,region r1 is combined with texton t1 in blue. (d) A second feature where region r2 is combinedwith texton t2 in green. (e) The response v[r1,t1](i) of the first feature is calculated at threepositions in the texton map (magnified). In this example, v[r1,t1](i1) 0, v[r1,t1](i2) 1, and
v[r1,t1](i3) 1/2. (f ) The second feature (r2, t2), where t2 corresponds to grass, can learnthat points i (such as i4) belonging to sheep regions tend to produce large values of v[r2,t2](i),and hence can exploit the contextual information that sheep pixels tend to be surrounded bygrass pixels.
of rectangle r.
Texture-layout features are sufficiently general to allow for an automatic learning of layout
and context information. Figure 3.8 illustrates how texture-layout filters are able to model
textural context and layout.
Boosting of texture-layout filters
A Boosting algorithm iteratively selects the most discriminative texture-layout filters (r, t) as
weak learners and combines them into a strong classifier used to derive the texture-layout
potential in Eq. 3.10. The boosting scheme used in TextonBoost shares each weak learner
between a set of classes C, so that a single weak learner classifies for several classes at once.
According to the authors, this allows for classification with cost sub-linear in the number of
classes, and leads to improved generalization.
The strong classifier learned is the sum over the classification confidences hmi (c) of M weak
-
7/30/2019 03 Martinelli
26/61
19 3.3 Example: TextonBoost
learners
H(yi, i) =Mm=1
hmi (c) (3.18)
The confidence value H(yi, i) for pixel i is then just multiplied by a negative constantso
that a positive confidence turns into a negative energy, which will be preferred in the energy
minimizationto give the texture-layout potentials i used in Eq. 3.10:
i(yi, x; ) = .H(yi, i) (3.19)
Each weak learner is a decision stump based on the feature response v[r,t](i) of the form
hi[c] =
a
v[r,t](i) >
+ b, if c C
kc, otherwise,(3.20)
with parameters (a,b,kc,,C,r,t). The region r and texton index t together specify the texture-
layout filter feature, and v[r,t](i) denotes the corresponding feature response at position i. For
the classes that share this feature, that is, (c C), the weak learner gives hi(c) {a + b, b}
depending on whether v[r,t](i) is, respectively, greater or lower than a threshold . For classes
not sharing the feature (c / C), the constant kc ensures that unequal numbers of training
examples of each class do not adversely affect the learning procedure.In order to choose the weak classifiers, TextonBoost uses the standard boosting algorithm
introduced by Schapire et al. in [8], which will be explained for completeness. Suppose we are
choosing the mth weak classifier. Each training example i, a pixel in a training image, is paired
with a target value zci {1, +1}where +1 means that pixel i has ground truth class c and
1 notand assigned a weight wci specifying its classification accuracy for class c after the
m 1 previous rounds of boosting. The mth weak classifier is chosen by minimizing an error
function Jerror weighted by wci :
Jerror = c iwci (z
ci h
mi (c))
2 (3.21)
The training examples are then re-weighted
wci := wci ezcih
mi (c) (3.22)
Minimizing the error function Jerror requires, for each new weak classifier, an expensive
brute-force search over the possible sharing classes in C, features (r, t), and thresholds . As
shown in [21] however, given these parameters, a closed form solution does exist for a, b and kc.
-
7/30/2019 03 Martinelli
27/61
Chapter 3: State of the art 20
3.4 Application to road scenes (Sturgess et al.)
In the more specific field of road scene segmentation, Sturgess et al. [22] have recently quite
successfully segmented inner-city road scenes in 11 different classes. Their method builds on
the work of Shotton et al. (see Section 3.3) and on that of Brostow et al. [3] integrating
the appearance-based features from TextonBoost with the structure-from-motion features from
Brostow et al. (see Section 3.1.2) in a higher order CRF. According to the authors, the use
of higher-order cliquesthat is, cliques with several pixels, instead of only pairs of pixels like
in TextonBoostproduces accurate segmentations with precise object boundaries. Figure 3.9
shows how Sturgess et al. use an unsupervised meanshift segmentation of the input image to
obtain regions that are used as higher-oder cliques and included in the energy function U to be
minimized.
Figure 3.9: The original image (left), its ground truth labelling (centre) and the meanshiftsegmentation of the image (right). The segments in the meanshift segmentation on the rightare used to define higher-order potentials, allowing for more precise object boundaries in thefinal segmentation.
Sturgess et al. achieved an overall accuracy of 84% compared to the previous state-of-
the-art accuracy of 69% [3] on the challenging CamVid database [4]. The work of Sturgess is
therefore especially important for this thesis as it successfully tackles the same inner-city scene
segmentation problem. The CamVid database will be better described in chapter 5, where the
results obtained by our implementation are compared with those of Sturgess et al. [22].
-
7/30/2019 03 Martinelli
28/61
Chapter 4
Methodology
4.1 CRF framework
After thorough consideration of related work, CRFs have been deemed very suitable and up-
to-date for dealing with the problem proposed in this thesis project. As discussed in section
2, conditional random fields allow the incorporation of a big variety of cues in a single, unified
model. Moreover, state-of-the-art work in the field of image segmentation (see Section 3.3,
TextonBoost) and also more specifically in the domain of inner-city road scene understanding
(see Section 3.4, Sturgess et al.) has used CRFs. Sturgess et al. have been able to very
successfully segment eleven different classes in road scenes, some of which being very important
to our final goal of driver behavior prediction.
4.2 Basic model: location and edge potentials
Location and edge cues, as mentioned in 3.1, are very meaningful and can significantly con-
tribute to the quality of any segmentation. In our case, location cues are all the more important
because we deal with a very spatially structured scene. The road will, for example, never be at
the top of the image and the sky will never be at the b ottom. We can then extract precious
information as to where to expect our classes to be located on the picture.
If, for a better understanding of the problem, we consider, at first, a model with just the
location and edge potentials, then the energy function to be minimized in order to infer the
21
-
7/30/2019 03 Martinelli
29/61
Chapter 4: Methodology 22
most likely labeling becomes
U(y|x, ) =i
location (yi, i; ) +
(i,j)
edge (yi, yj , gij(x); ) . (4.1)
The location potential is calculated based on the incidence, for all the training images, of each
class at each pixel:
(yi, i; ) = log
Nyi,i +
Ni +
(4.2)
where Nyi,i
is the number of pixels at position i assigned class yi
in the training images, Ni
is
the total number of pixels at position i and is a small integer to avoid the indefinition log(0)
when Nyi,i = 0 (we use = 1). Figure 4.1 illustrates the location potential of classes road
and sidewalk in images from the CamVid database.
(a) (b)
Figure 4.1: (a) Location potential of class road, (road, i; ). (b) Location potential ofclass sidewalk, (sidewalk, i; ). The whiter, the bigger the incidence of pixels from thecorresponding class in the training images.
The pairwise edge potential has the form of a contrast sensitive Potts model [2] like defined
in TextonBoost:
(yi, yj , gij(x); ) = T gij(x)[yi = yj ], (4.3)
with [] the zero-one indicator function. The edge feature gij measures the diference in color
between the neighboring pixels, as suggested by [19],
gij =
exp(xi xj
2
1
(4.4)
With the help of an intuitive example, shown in Figure 4.2a, we can see how location and
-
7/30/2019 03 Martinelli
30/61
23 4.2 Basic model: location and edge potentials
edge potentials interact, resulting in a meaningful segmentation. In this example, we want to
segment the toy image in three different classes, background, foreground-1 and foreground-2.
Figures 4.2b, 4.2d and 4.2f show the unary location potentials (yi, i; ) for classes foreground-
1, foreground-2 and background, respectively, at every pixel i 1. A white pixel represents a
high probability of a class being present at that pixelwhich is equivalent to saying that the
energy potential is lowimpelling the function minimization to prefer labels where the pixels
are white rather than black. Figure 4.2c shows the gradient image, which is a way to visualize
the edge potential calculated like in Eq. 4.3. The segmentation boundaries are more likely to
be located where the edge potential is white. Figure 4.2e shows the final segmentation obtained
through the minimization of Eq. 4.1.
(a) (c) (e)
(b) (d) (f)
Figure 4.2: (a) Noisy toy image to be segmented. (c) Gradient image as basis for edge po-tential. (b,d,f) Location potentials of classes foreground-1, foreground-2 and background,respectively. (e) Final segmentation inferred from the minimization of Eq. 4.1.
Note that the final segmentation correctly ignores the noise, as it is not present at the samepixels simultaneously in the edge and location potentials. The red and yellow structures inside
the main blob are all segmented as class foreground-1 thanks to the contribution of its location
potential. The constant term in Eq. 4.3, which adds a given cost for any pixel belonging to a
label boundary, helps suppress the appeareance of noisy, small foreground regions.
1The location potential of the class background is the complementary of the foreground classes potentials.That is, when either class foreground-1 or foreground-2 is likely, class background is unlikely, and vice-versa.
-
7/30/2019 03 Martinelli
31/61
Chapter 4: Methodology 24
4.3 Texture potential model
Although the segmentation of the toy example, obtained with location and edge potentials
described in the last section, was robust against noise, the location potentials provided were
very similar to the regions we wanted to segment. In real images, not only the location potentials
are less correlated to the position of the labels, but there are also much more complex objects
to be segmented that cannot be differentiated just by using location and edge potentials. The
next step to a better segmentation is then modeling the texture information present in the
images. We can represent this new potential by rewriting the energy function U as:
U(y|x, ) =i
location (yi, i; ) +
texture i(yi, x; ) +
(i,j)
edge (yi, yj, gij(x); ) (4.5)
Note that the texture potential represents local texture only, i.e., it does not take into
account context. It is merely a local feature. Context and layout are explored in Section 4.4,
where the use of simplified texture-layout filters is investigated.
In order to represent the texture information of the images to segment, we opted, similarly
to TextonBoost [21], for the use of filter banks. By using an N-dimensional filter bank F,
one obtains an N-dimensional feature vector, fx(i), for each pixel i. Each component of this
vector is the result of the convolution of the input image converted to grayscale, x, and the
corresponding filter shifted to the position of i:
fx(i) =
(F1 x)|i
(F2 x)|i...
(FN x)|i
(4.6)
Equivalently, the result of the convolution of a N-dimensional filter bank with an image can
be understood by considering the convolution of the image with each component of the filter
at a time. Figure 4.3 shows an example of input image, and the response images for some ofthe Leung-Malik filter bank components [16].
4.3.1 Feature vector and choice of filter bank
The choice of the filter bank used to represent the texture in the images to be segmented was
based on the following criteria:
Good coverage of possible textures without too much redundancy between filters;
-
7/30/2019 03 Martinelli
32/61
25 4.3 Texture potential model
(a) (c) (e)
(b) (d) (f)
Figure 4.3: (a) Example of inner-city road scene image. (b-f) Examples of responses of fivedifferent filter components of the LM filter bank, which are shown at the bottom left corner ofeach figure.
Fast and efficient filter response calculation;
Ready-to-use implementation available.
Considering those criteria, a very interesting implementation by the Intelligent Systems Lab
of the University of Amsterdam has been found. It is implemented as a matlab .mex file, which
means it is actually a C script which is pre-compiled and then called by Matlab in execution
time. The libraries are freely available for research purposes2.
Using this fast to calculate .mex implementation, 5 different filter banks have been assessed
by segmenting images using only the texture potential in Eq. 4.5. Four classes have been
considered: road, sidewalk, others3 and sky.
The filter banks assessed were the following:
MR8 The MR8 filter bank consists of 38 filters but only 8 filter responses. The filter bank
contains filters at multiple orientations but their outputs are collapsed by recording onlythe maximum filter response across all orientations (see Figure 4.4);
MR8 - no maxima The rotation invariance of the MR8 filter bank, achieved by taking
only the maximum response over all orientation, may not be a desired property of a tex-
ture filter bank used for segmentationsome classes could be described by their features
2Source code at: http://www.science.uva.nl/mark.3Class others is assigned to any pixel that is not labeled as one of the other three classesit can thus be
seen as the complement of the other three classes.
-
7/30/2019 03 Martinelli
33/61
Chapter 4: Methodology 26
orientation. Therefore, a filter bank called MR8 - no maxima has been defined, where all
the 38 responses are kept;
MR8 - separate channels Here, the MR8 filter is applied individually to each of the
three color channels, in an attempt to verify whether discriminative texture information
is unevenly distributed over the color channels;
MR8 - multi-scale This filter bank is composed of three MR8 filter banks in three
subsequent scales. Although the MR8 filter bank itself already uses filters in different
scales, we found interesting to try to cover even more scales as road scenes contain,
almost always, objects whose distance may vary in many orders of magnitude 4;
TextonBoosts filter bank This filter bank has 17 dimensions and is based on the CIELab
color space. consists of Gaussians at scales k, 2k and 4k, x and y derivatives of Gaussians
at scales 2k and 4k, and Laplacians of Gaussians at scales k, 2k, 4k and 8k. The Gaussians
are applied to all three color channels, while the other filters are applied only to the
luminance.
Figure 4.4: The MR8 filter bank is low dimensional, rotationally invariant and yet capable ofpicking out oriented features. Note that only the maximum response of the filters of each ofthe first 6 rows is taken.
As all the filter banksexcept MR8 - separate channels and TextoonBoosts filter bankare
convolved with grayscale images, we also concatenated to the texture feature vector fx(i)
4For instance, there might be a car immediately in front of the camera but also another one tens of metersaway.
-
7/30/2019 03 Martinelli
34/61
27 4.3 Texture potential model
which is the response of the filter bankthe L, a and b color values of its corresponding pixel:
fx(i) =
fx(i)
Li
ai
bi
(4.7)
In this manner, the color information was merged with the texture, giving an extra cue to
the Adaboost classifiers5.
The results of the tests showed that the filter bank that yielded the best segmentation
results and, thus, best represented the texture information in the road scene images was the
MR8 - multi-scale. This is probably due to the aforementioned fact that road scene images
have similar objects and regions that may vary greatly in depth. This variation is well captured
by the multiple-scale characteristic of the MR8 - multi-scale filter bank.
Combination of 3D cues to feature vector
As discussed in section 3.1.2, 3D information can be extracted from images in a video sequence
using structure from motion techniques. Those techniques can only infer the 3D position of
characteristic points in the image, that is, points that can be located, described and then
matched in subsequent images. In this thesis this has been done using the Harris corner detector,with normalized cross-correlation over patches for matching. Other possible patch descriptors
are, for example, SIFT and SURF.
All 3D features mentioned in section 3.1.2 have been concatenatedjust like the L, a and
b color valuesto the feature vector described in Eq. 4.7:
fx (i) =
fx(i)
3Dfeature1(i)...
3Dfeature5(i)
(4.8)
However, in order to include this 3D cues in our feature vector, they need to be defined for
every pixel of an input image. That means that we have to transform the sparse 3D features
obtained using reconstruction techniques into dense features. This can be done by interpolation,
where every pixel is assigned 3D feature values based on the values of the sparse neighbors that
could be defined with reconstruction techniques.
5Tests have been performed with different color spaces, yielding the best results when CIELab was used.This comes from the fact that the CIELab color space is partially invariant to scene lighting modificationsonlythe L dimension changes in contrast to the three dimensions of the RGB color space, for instance.
-
7/30/2019 03 Martinelli
35/61
Chapter 4: Methodology 28
Figure 4.5 shows an example of dense interpolation of the 3D feature height above ground
for an image taken from the CamVid database.
(a) (b)
Figure 4.5: (a) A dusk image taken from the CamVid database. (b) The calculated height aboveground 3D feature. After determining a point cloud from structure from motion techniques,the sparse features have been interpolated as to yield a dense representation. Notice how thesky has high values and that we can see a faint blob where the car is located in the originalimage.
It is important to mention that, before concatenating them to the feature vector as shown
in Eq. 4.8, the 3D features have been appropriately normalized. The normalization guarantees
that they do not overshadow the texture and color features during the clustering process. This
could happen if the values of the 3D features were much greater than the values of the other
features. Since the clustering method implemented uses Euclidian distances, such an imbalance
in the feature values would result in biased cluster centers. The influence of the use of 3D
features on the segmentation results is discussed in Chapter 5.
4.3.2 Boosting of feature vectors
Having defined the feature vector as in Eq. 4.8, we need then to find patterns in the features
extracted from training images and try to recognize them in new, unseen images. For instance,we want to learn what texture, color and 3D cues are typical in each of the classes we want to
segment. Some of the machine learning techniques suitable for this task are neural networks,
belief networks or Gaussian Mixture Models in the N-dimensional space (where N is the number
of filters in the filter bank). Nonetheless, an Adaboost approach has been preferred for its
generalization power and ease to use.
A short overview about the way Adaboost works is described here. For more details about
its implementation and theoretical grounds, please see [8]. For this thesis project we have
-
7/30/2019 03 Martinelli
36/61
29 4.3 Texture potential model
Figure 4.6: Example of training procedure for classifier road. The Q K data matrix D isrepresented by the red vectors whereas the 1K label vector L is indicated by the green arrows.
utilized a ready-to-use Matlab implementation from the Moscow state university6.
Note that, since we are dealing with binary Adaboost classification, a classifier is trained
for each of the classes we want to segment in a one-versus-all manner. For the training of each
classifier, a learning data matrix D RQK is taken as input by the Adaboost trainer. Matrix
D has size Q K, where Q is the number of dimensions7 of the feature vector from Eq. 4.8 andK is the number of training vectors used for training (the feature vectors are extracted from
pixels in the training images). Another input, a 1 K vector L {0, 1}1K, contains the labels
of the training data D. Vector L is comprised of ones for the pixels belonging to the class of
the classifier being trained, and zeros otherwise. Figure 4.6 illustrates how individual classifiers
for each class are trained.
The Adaboost classifier of class c is composed of M stump weak classifiers hc(f),
hc(f) =
1 if fp >
0, otherwise(4.9)
where fp is the pth dimension of vector f and is a threshold. The strong classifier Hclassc(f(i))
is built by choosing the most discriminative weak learners by minimizing the error to the target
value, like explained in Section 3.3.2. Figure 4.7 shows how a trained classifier outputs a
confidence value between zero and one for feature vectors from unseen images.
Once we have defined a strong classifier H for each class, the texture potential of Eq. 4.5
6Source code available at http://graphics.cs.msu.ru/en/science/research/machinelearning/modestada .7Q = N(number of dimensions of the filter bank) +3(L,a,b) + 5(3D features).
-
7/30/2019 03 Martinelli
37/61
Chapter 4: Methodology 30
Figure 4.7: Given a trained classifier, a classification confidence is computed based on how
similiar the input feature vector is to the positive examplesand on how different it is fromthe negative onesprovided in the training phase illustrated in Figure 4.6.
can be defined as:
i(yi, x; ) = .Hclassyi (fx(i)) (4.10)
The output of the strong classifier Hclassyi (fx(i)) is multiplied by a negative constant, so
that a positive confidence turns into a negative energy, which will be preferred in the energy
minimization. is the set of all parameters used in the Adaboost training of H, for instance,
number of weak classifiers.
4.3.3 Adaptive training procedure
In order to make the training of Adaboost classifiers more tractable, not every pixel of every
training image has been selected to build the training data matrix D. Since there is a lot
of redundancy between pixels, this simplification has not adversely affected the quality of the
Adaboost classifiers.
Although the selection of pixels for the extraction of training feature vectors has initially
been random, a smarter and innovative algorithm has been developed.
The adaptive training procedure works by, in an iterative way, choosing an unequal propor-
tion of feature vectors from each label. The idea is that, based on the confusion matrix of agiven segmentation experiment, we know the strengths and weaknesses of the classifiers trained.
For instance, suppose that for a given segmentation experiment class sky is not confused as
much as street and sidewalk. Then, it is reasonable that we choose in the next segmentation
experiment more feature vectors from classes street and sidewalk and less feature vectors
from class sky for the training of classifiers street and sidewalk.
Formally, if we represent the weight (or proportion) of training feature vectors from class
i, used in the Adaboost training of classifier j, as Wij, the update of every weight after each
-
7/30/2019 03 Martinelli
38/61
31 4.4 Texture-layout potential model (context)
segmentation iteration (experiment) can be expressed as:
Wij =
Wije
Cmij/Z if i = j
Wije(1Cmij)/Z if i = j
(4.11)
where Cmij is the element in the ith row and jth column of the confusion matrix of the
previous segmentation iteration. is a learning speed factor and Z is a normalization factor
that guarantees that
iWij = 1, (4.12)
or, in other words, that the sum of the proportions of feature vectors from each class remains
equal to 1. The weights are all equally initialized as Wij = 1/Nc, Nc representing the number
of classes.
Notice that in the case of a perfect segmentation, where the confusion matrix is equal to
the identity matrix, the proportion of training feature vector samples Wij does not change.
Although the adaptive learning algorithm improved a lot the segmentation quality (see
Section 5.1), the use of local features alone is intrinsically limited. As precise and discriminative
as a classifier may be, there are cases where class sidewalk is virtually identical to class road
for every local feature imaginable. The natural next step towards a better segmentation is to
use context information. Then, the fact that sidewalks are normally alongside roads, separating
them from buildings or other regions, can be explored and help us correctly differentiate what
locally is indifferentiable.
4.4 Texture-layout potential model (context)
In order to model contextual information, we opt for utilizing the texture layout features in-
troduced by TextonBoost. This new potential replaces the texture potentials explained in the
previous section, as they are more general. We then have the following energy function:
U(y|x, ) =i
location (yi, i; ) +
texture-layout i(yi, x; ) +
(i,j)
edge (yi, yj , gij(x); ) (4.13)
In this equation, the texture-layout potentials are defined similarly to the way they are defined
in TextonBoost:
i(yi, x; ) = .H(yi, i) (4.14)
-
7/30/2019 03 Martinelli
39/61
Chapter 4: Methodology 32
The confidence H(yi, i) is the output of a strong classifier found by boosting weak classifiers,
H(yi, i) =Mm=1
hmyi(i) (4.15)
Each weak classifier, in turn, is defined based on the response of a texture-layout filter:
hmyi(i) =
a, if v[r,t](i) >
b, otherwise,(4.16)
Notice the difference from the definition in Eq. 3.20 of TextonBoost: bearing in mind our
final goal of behavior prediction, we do not need to classify as many classes as in TextonBoost
where up to 32 different classes are segmented. TextonBoost shares weak classifiers because
the computation cost becomes sub-linear with the number of classes. Since we do not need as
many classes, it is possible for us to simplify the calculation of strong classifiers by not using
shared weak classifiers. Therefore, in our approach, each strong classifier has its own, exclusive
weak classifiers.
The texture-layout filter response v[r,t](i) is the proportion of pixels in the input image,
from all those lying in the rectangle r with its origin shifted to pixel i, that have been assigned
texton t in the textonization process illustrated in section 3.3.2:
v[r,t](i) =1
area(r)
j(r+i)
[Tj = t] . (4.17)
4.4.1 Training procedure
We used, for our textonization process, the same feature vector definition as in Eq. 4.8, which
contains texture, color and 3D cues.
In order to build a strong classifiernote that we need to train one strong classifier for
each of the classes we want to segment our image in, weak classifiers are added one by one
following the following boosting procedure:
1. Generation of weak classifier candidates: Each weak classifier is composed of a texture-
layout filter (r, t) and a threshold . The candidates are generated by randomly choosing
a rectangle region inside a bounding box, a texton index t T = {1, 2, , K} where K is
the number of clusters used in the textonization process, and finally a threshold between
0 and 1. For the addition of each weak classifier, an arbitrary number of candidates, Ncd,
is generated.
-
7/30/2019 03 Martinelli
40/61
33 4.4 Texture-layout potential model (context)
2. Calculation of parameters a andb for all candidates: Each weak classifier candidate must
also be assigned values a and b so that its response, hmc (i), is fully defined (see Eq. 4.16).
Like described by Torralba et al [23], who use the same boosting approach (except our
does not share weak classifiers), a and b can be calculated as follows:
b =
i w
ci zci
v[r,t](i)
i w
ci
v[r,t](i)
, (4.18)a =
i w
cizci
v[r,t](i) >
i wci v[r,t](i) >
, (4.19)
where c is the label for which the classifier is being trained, zci = +1 or zci = 1 for
pixels i which, respectively, have ground truth label c or different from c, and wci the are
classification accuracy weights used by Adaboost (see Section 3.3.2).
Note, from Eq. 4.18 and Eq. 4.19, that, for the calculation of a and b, the response of the
texture-layout filters, v[r,t](i), must be calculated for all training pixels i and compared
to threshold .
3. Search for the best weak classifier candidate: Once each weak classifier is fully defined, that
is, all parameters (r,t,,a,b) are defined, the most discriminative among the candidates
is found by minimizing the error function with respect to the target values zci
.
In Chapter 5 we see how texture-layout strong classifiers can learn the context between
objects. We observe also how the number of weak classifiers influences the segmentation quality.
4.4.2 Practical considerations
System architecture
Due to the short period of time available for this thesis work, the implementation of software had
to be efficient and fast. Owing to its flexibilityand variety of ready-to-use image processing,
statistics, plotting and other functions availableMatlab has been the preferred tool for theimplementation of the solution.
Conditional Random Fields are, however, intrinsically highly demanding in computational
resources. This is due to the iterative nature of the minimization procedure of the cost function
U, detailed in section 3.2.1. As Matlab is an interpreted programming language, it is signifi-
cantly slower to process loops than compiled languages such as C or C++. Therefore, Matlab
has proven to be unable to cope with the massive calculations needed for the segmentation
inference, when the cost function U is minimized.
-
7/30/2019 03 Martinelli
41/61
Chapter 4: Methodology 34
Figure 4.8: Software architecture. The Matlab layer is responsible for the higher-level processingwhereas the C++ layer takes the heavy energy minimization computation.
In the context of the iCub project [12]which is lead by the RobotCub Consortium, consist-
ing of several European universities, a good C++ framework for the minimization of Markov
Random Field energy functions has been found. The main goal of the iCub platform is to
study cognition through the implementation of biological motivated algorithms. The project is
open-sourceboth the hardware design and the software are freely available.
The software implemented has been then based on a two-layer layout, as illustrated in
Figure 4.8. Matlab, on a higher-level, pre-processes imagescalculating, for instance, filter
convolutionswhereas the C++ program calculates the minimum of the energy function U.
In other words, the C++ layer infers, from the given cliques potentials and input Matlab
pre-processed data, what the maximum a posteriori likelihood labeling is.
The assessment of the quality of the segmentations, the storage of results and all comple-
mentary software functionalities are handled by Matlab on the higher-level layer.
-
7/30/2019 03 Martinelli
42/61
35 4.4 Texture-layout potential model (context)
Implementation challenges and optimizations
Differently from the case of the texture potential explained in the previous section, we could not
find any ready-to-use Matlab implementation of the boosting procedure for the texture-layout
potential, as it is very specific to this problem. The whole algorithm had then to be implemented
from scratch. Moreover, since there are countless loops involved in the training algorithm
described above, Matlab was ruled out as programming environment of the implementation,
being replaced by C++.
Two main practical problems have been faced in the C++ programming of the developed
algorithm described above. Firstly, the long processing time and, secondly, the lack of RAM
memory.
1. Processing time: The boosting procedure described in the previous section requires com-
putations over all training pixels. If we consider 100 imagesa typical number for a
training data seteach composed of, for instance, 800 600 pixels, we have already 48
million calculation loops for each step. This turns out to be impractical for todays pro-
cessors. The solution found was to resize all dataset images before segmenting them and
also to consider, as training pixels, only a subsampled set of each image. By resizing the
images to half their original size and subsampling the training pixels in 5-pixel steps, we
could already reduce the number of calculation loops 100 times. After this simplifica-
tion was applied, the decrease in segmentation quality was almost imperceptible, which
indicates that the information necessary for training the classifiers was not lost with the
resizing and subsampling.
2. RAM memory:
As discussed in section 3.3, the use of integral images is essential for the efficiency of the
calculation of the texture-layout filters v[r,t]. If we consider that 100 textons have been
defined in the textonization process, we have, for each training image, 100 integral images,
one for each texton index. Again, considering 100 training images already resized to half
their original size, we have ten thousand 400 300 matrices (each matrix represents an
integral image). If we use a normal int variable for each matrix elementwhich in C++
occupies 4 byteswe need 10000 400 300 4 = 4.8 Gigabytes or RAM memory.
The first attempt to avoid this memory problem was to load only some of the integral
images at a time. However, for the calculation of the texture-layout filter responses of the
weak classifier candidates, all the integral matrices are necessary. They had then to be
all simultaneously accessible in the RAM memory.
The solution was to use short unsigned integerswith only 2 bytes, which were big
-
7/30/2019 03 Martinelli
43/61
Chapter 4: Methodology 36
enough for all the integral matrices analyzed8, and also to subsample the integral image
matrices:
I(t)(u, v) = I(t)
round (u, v)
SubsamplingFactor
(4.20)
Again, the subsampling almost did not change the results of the final segmentations. One
of the reasons why the results did not change much is probably that the subsampling
rate of 3 used is much smaller than the sizes of the rectangular regions r used in the
texture-layout features. Although the subsampling reduced the amount of RAM memory
necessary for loading the integral images, there is a limit of training images that can be
used for training without causing memory problems.
8Each short unsigned integer can store a number of up to 65535. If we consider a 400 300 pixel image, themaximum value of an integral imageif all pixels were assigned to one single textonis 120000. However, sinceeach pixel is assigned to one of many texton indexes, the integral image of each texton never has values close tothe limit 65535.
-
7/30/2019 03 Martinelli
44/61
Chapter 5
Results
In this chapter, we investigate the performance of our semantic segmentation system on the
challenging CamVid dataset and compare our results with existing work. Firstly, we show
preliminary results obtained with the texture features described in Section 4.3 without consid-
ering any context. We then analyse our final model with the context features (texture-layout
features) described in Section 4.4. The effect of different aspects and parameters of the model
is discussed before we present the best results obtained and analyse them quantitatively and
qualitatively.
5.1 Model without context features
Figure 5.1 shows the confusion matrix of the segmentation of approximately 200 pictures, with
classifiers trained on 140 other pictures, all randomly taken from the CamVid database. For this
segmentation experiment, 500 training feature vectors have been randomly chosen per training
image. The segmentations have been computed by minimization of Eq. 4.5 which does not
include any context feature. Notice how sidewalks are almost not recognized at all.
The adaptive training procedure described in Section 4.3.3 chooses for the training of the
adaboost classifiers more examples of feature vectors from labels that are confusedlike roadand sidewalkthan from those who are easily recognizedlike sky. The confusion matrix of
Figure 5.1 shows the results of the segmentation of the first iterationwhere all training vectors
are chosen randomlyof this adaptive Adaboost training algorithm. After three iterations,
examples are selectively chosen and the confusion matrix of the segmentation results, shown in
Figure 5.2, shows much better discernment between classes that were initially mixed up.
Although the adaptive training procedure improved the segmentation quality, context in-
formation, as discussed in the next section, contributes to differentiate classes even better.
37
-
7/30/2019 03 Martinelli
45/61
Chapter 5: Results 38
Figure 5.1: Confusion matrix of segmentation experiment chosing random feature vectors fortraining the Adaboost classifiers. Each row shows what proportion of the ground truth classeshas been assigned to each class by the classifiers. Class others is the union of all classes defined
in the CamVid database except street, sidewalk and sky. For an ideal segmentation, theconfusion matrix would be equal to the identity matrix.
Figure 5.2: Confusion matrix of segmentation after three iterations of the adaptive training.Initially, 65% of class sidewalk was wrongly assigned to class road, as compared to only 25%with the adaptive learning. The percentage of class sidewalk correctly assigned also increasedfrom 9% to 61%.
5.2 Model with context features
Our final model includes the texture-layout potential (see Section 4.4). This model and its
results are discussed in detail in the following sections.
5.2.1 Influence of number of weak classifiers
As illustrated in Figure 3.8, texture-layout filters work by exploring the contextual correlation
of texturesand in our solution also colorbetween neighboring regions. Figure 5.3 shows the
rectangular region r of each of the first ten texture-layout features for the classifier of class
road. Notice that the location distribution of the regions r is slightly biased towars either the
top half or the bottom half of the image. This comes, probably, from the fact that most of
the correlations between textures present in class road and other textures happen in a vertical
fashion: the road is normally below other classes.
-
7/30/2019 03 Martinelli
46/61
39 5.2 Model with context features
Figure 5.3: r regions of first ten weak classifiers composing the strong classifier for class road.The yellow cross in the middle indicates the pixel i being classified and the blue rectanglerepresents the bounding box within which all the weak classifiers candidates are created. Thebigger the