Visit to GRASP Lab: 2009

Tuesday, November 24, 2009

Vanishing Point Computation

1. The vanishing point computation is performed using a vote scheme from all the lines detected in the scene. As an example we can observe the following image, where the dominant directions have been computed.

2. From the vanishing points the rotation matrix of the camera withrespect to the scene is computed

R = (VpX, VpY,VpZ) ^T

3. Then we use this matrix to rectify the images undoing the rotation R. For example we can see next images where the one on the left is the original image and the right one is the rectified one.

4. We were able to compute the VPs trough an indoors sequence. This sequence has 2200 frames. The rotation angles obtained from the rotation matrix are the following

5. From the last graphic we can observe that the camera is performing a planar motion and the x-axis gives us the orientation. The drastic changes in x-axis indicates a change of direction. This happens when the camera rotates around a corner (~90º). This situation can be observed here

Frame 685

Frame 686

6. Analyzing the rotations about the x-axis (alpha) we are able to detect the changes of orientation and with this information we can build a very general topological map based on rotations.

7. To avoid having all the frames and their respective rectification matrices we need to select some frames, we call these frames "keyframes". These keyframes will represent the nodes of our topological map. The keyframes are selected by computing the Essential matrix between two frames. To do this we use the 5-point algorithm

8. From the estimated Essential matrix we extract the rotation matrix and we compare it to the one extracted from the vanishing points.

Rxy_vanishing = R{frame_x}' * R{frame_y}

Rextracted = extract_R_from_E(E)

9. If the rotation matrix obained from the Essential matrixis congruent with the one computed from the visual compass we select another frame to compute a new Essential matrix. When we can't go on computing Essential matrices we select those frames as keyframe and start the process again from the begining. The output of this process will be the topological map.

Tuesday, April 7, 2009

Automatic Recovery of Relative Camera Rotations for Urban Scenes

Main idea: Based on the computation of the vanishing points (VP) and their matching through different images coming from different cameras, (hemispherically-tiled set of images captured from a single position in space) the computation of the relative camera rotation is performed.

This work first compute the vanishing points present in each image based on the direction of the edges extracted from the scene. The representation used for these edges is the intersection of the plane through the edge and the focal point with the Gaussian sphere. The direction of the 3D edges (vanishing points) is computed using an EM approach.

The vanishing point estimation is composed of two tightly-coupled sub-problems: classification (grouping observed edges into parallel sets) and estimation (finding the best VP for each set). The authors present an hybrid approach to VP detection and estimation which combines the Hough transform (HT) for detection with the acccuracy of least squares for estimation. An EM algorithm is formulated to probabilistically model edges and their directional uncertainty. A final verification step rejects spurious directions.

link: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=854809

Monday, April 6, 2009

Video Compass

Main idea: Extract the relative orientation of the camera with respect to the scene. This approach is based on the properties of the vanishing points. The vanishing points are computed from the intersection of parallel lines in uncalibrated images. It assumes that the majority of lines in man-made environments is alligned with the principal orthogonal directions of the world coordinate frame.

This approach can be summarized as follows:

Line Detection.
Lines are grouped into the dominant vanishing directions. In man-made environments are alligned with the principal orthogonal axes of the world reference frame.
Computation of the vanishing points

Two cases: calibrated and uncalibrated camera. This work assumes the uncalibrated case.

The grouping and the estimation of the vanishing points is performed at the same time using an EM approach.

The relative orientation of the camera wrt the scene is computed using the vanishing point constraints.

Calibration

The relation between image coordinates of a point and its counterpart in the calibrated and uncalibrated cases

λx = RX+ T and λx' = KRX+ KT

x' denotes a pixel coordinate of X and K is the intrinsic parameter matrix of the camera. The unit vectors of the world coordinate frame are e_i = [1,0,0]', e_j = [0,1,0]' and e_k = [0,0,1]' . The vanishing points corresponding to 3D lines parallel to either if these directions are

v_i = KRe_i
v_j = KRe_j
v_k = KRe_k

the orthogonallity relation between e_i, e_j, e_k readily provide constraints on the calibration matrix K. In particular

e'_i e_j = 0 -> v'_i K^(-T)RR^(-T)K¯¹v_j = v'_iK^(-T)K¯¹v_j = v'_i Sv_j = 0

where S is the metric associated with the uncalibrated camera S = K^(-T)K^(-1).
When three different vanishing points are detected, they provide three independient constraints on matrix S
v_i S v_j = 0, v'_i S v_k = 0 v'_j S v_k = 0
The zero skew constraint expresses the fact that the image axes are orthogonal can be written as [1, 0, 0]' S [0, 1, 0] = 0

with these constraints the matrix K can be computed by minimizing ||Bs||²

Relative orientation

since the vanishing directions are projections of the vectors associated with three orthogonal directions i, j, k, they depend on rotation only. In particular we can write that:

K¯¹v_i = Re_i K¯¹v_j = Re_j K¯¹v_k = Re_k

with each vanishing direction being proportional to the column of the rotation matrix R = [r_1, r_2, r_3]. Choosing the two best vanishing directions, properly normalizing them, the third row can be obtained as r_3 = cross(r_1,r_2) by enforcing the orthogonality constraints. There is a four way ambiguity in R due to the sign ambiguity in r1 and r2. Additional solutions can be eliminated by considering relative orientation or structure constraints.

Planar Ego-Motion Without Correspondences

Motion estimation as Hough

Camera viewing points P_i vectors relative to the camera's reference system.
P_i's are projected to p_i = P_i/||P_i || by a spherical projective projection
The camera undergoes a rigid motion in the X-Y plane

This motion is given by (R,t) where R = R_z(α) and t, a vector in theX-Y plane

t = R_z(θ)·e_1

From the camera's new coordinate frame, the world points are given as Q_i = RP_i + t and the world points Q_i map to image points q_i = Q_i/||Q_i ||
We infer that Rp_i, q_i and t lie on the same plane X-Y. This is expressed as

(Rz(α)pi × qi)'Rz(θ) e1 = 0.

Extension to any plane given by the angles β and γ.

The vector Rz(γ)Ry(β)e_3 is orthogonal to all vectors lying in the plane.

We can define any plane of motions with rotations.

Motions on the plane (β, γ) are simply a coordinate frame rotation from motions in the X-Y plane.
If the point pairs (p_i, q_i) satisfy a motion in the plane (β, γ), then the point pairs (p'_i, q'_i ), where {p'_i, q'_i } = Ry(−β)Rz(−γ){p_i, q_i} satisfie the same motion in the X-Y plane.

Epipolar constraint for general planar motions can be expressed as

Rz((Rz(α)(Rz(γ)Ry(β)' p_i × (Rz(γ)Ry(β))' q_i)'Rz(θ) e1 = 0

Tuesday, February 24, 2009

Analysis of methods

We can use as base the approach at GRASP lab:

It uses a geometric approach (epipolar geometry) 5 and 3-point algorithms combined with a Preemptive RANSAC to compute accurately the position of the camera.
It also reconstruct the scene from 3D-2D correspondences.
Based on that is able to perform localization and mapping.

Loop-closing is performed using a bag-of-words-based approach.

GPS data is used to restrict the search area. (This is not possible in indoors).

It also uses epipolar geometry to avoid wrong matches.

To apply this approach in a topological layer we should:

Define the nodes in the topological map. These can be represented as:

Fingerprints (Tapus 2005)
Epitomes (Ni 2008)
unsupervised learning approach used to create clusters based on the information present in the acquired images. (Kuipers 2002, Bowling 2006)

Define the relation between these nodes:

Metric information (distance, angle)
Conectivity to the neighbors.

Define the actions to be performed by the robot. These actions can be:

Explore new space
Go to the center of free space
Leave room
Follow the mid-line
Etc

Some works that use a topological approach to perform appearance-based SLAM are the following:

Appearance-Based Topological Bayesian Inference for Loop-Closing Detection in a Cross-Country Environment (Chen and Wang 2006).

They use PCA to model the environment's appearance.
Their distributions are approximated by a series of Gaussian models.

FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance

It is based on the bag-of-words retrieval systems (Níster06).
It can identify if a new observation is already included in the map or if it comes from a previously unseen place by using a probabilistic framework (loop-closing).
It uses SIFT/SURF detector/descriptor.

See the references in the blog.

Proposal:

FAB-MAP can be extended adding metrical information provided by epipolar geometry and scene reconstruction.

Appearance-Based Topological Bayesian Inference for Loop-Closing Detection in a Cross-Country Environment

A visual bag of words method for interactive qualitative localization and mapping

Main idea: The bag-of-words approach is adapted to perform visual localization. It is based in the creation of a dictionary from the words (features) acquired from images. Then the localization is performed by comparing the words from a new scene with the ones stored in the dictionary.

Visual localization

The goal of the classifier is to infer the room from an image.
The classifier can be trained incrementally by a vote scheme.
Active learning is performed when then classifier fails and the user has to provide the right label for the place that was not well classified.

more detailed (assume the map is already built)

The features are extracted and the corresponding words are found in the dictionary. These words then vote at a first level for the rooms in which they have been perceived.
The vote result is calculated by the difference between the maximum and the second maximum.
The winning category votes at a second level.
This process is repeated with the other feature spaces and with new images until the quality of the second level vote reaches a given threshold.

Mapping is performed in two processes

Building the dictionary
Gathering data for the classifier
User-aided approach
Memorize in which category a given word has been perceived

Features used

SIFT
local color histograms
local normalized grey level histograms

Monday, February 23, 2009

Epitomic Location Recognition

Main idea: It uses a generative model based on epitomic image analysis. This analysis is based in a probabilistic approach.

The appearance and geometric structure of the environment is captured into this representation.
It has the ability to model translation and scale invariance together with the fusion of diverse visual features yield enhanced generalization with economical training.
The recognition of a location class is achieved by convolving the query image and the learned epitome.
It doesnt estimate the accurately the camera position.
Occlusions, reflections or non-rigid motions are modeled as noise whose variance changes for different regions within the environment.
These epitomes are generative, probabilistic models and various sources of uncertainty are captured in the variance maps.
In this model, an image is extracted from a larger latent image, the epitome, at a location given by a discret mapping.
Every N x M image I is generated from a Ne x Me location epitome e.
In this approach Ne x Me translations and 3 scales are considered.

Inference and learning

Every image is independient and identically distributed given the epitome.
The goal is to find a single epitome e* which maximizes the probability of the observations. This is achivied by EM algorithm.

Visual features

raw RGB pixels
gist features
disparity maps
local histograms

It uses a stereo camera.

Experiments On Visual Loop Closing Using Vocabulary Trees

Main idea: Several vocabulary trees as GPS data are used to deal with the problem of visual loop closing for long trajectories in an urban environment. The loop closing is performed by searching in the tree the most similar frame to the current one. Epipolar geometry is used to prune out the bad matches.

To deal with the problem of weak definitions of loop-closing a latency test is performed. This consist on testing only images that have not been taken in the last n seconds but that have been taken before that from a viewpoint in a threshold radius. Only the matched images that hold the geometric constraint are considered.

Two types of trees are tested in this work:

Hierarchical K-means
Extremely Randomized Forests (ERT)

Summarizing this approach:

A vocabulary tree is built off-line
Each image of the sequence is added to the vocabulary tree using inverted files.
When the vehicle is close to a previously seen scene, the descriptors of the current image and the inverted files are used to obtain the N closest images.
Epipolar geometry is used to determine if the matches are good.

Monocular Visual Odometry in Urban Environments Using an Omnidirectional Camera

Main idea: Performing visual SLAM using Preemptive RANSAC with the 5-point and 2-point algorithm to compute the position accurately. It uses SIFT correspondeces to compute the epipolar geometry from which the translation and rotation are obtained in a decoupled way.

Landmark detection and tracking. This approach use detector descriptor and matching for SIFT points with a slight modification on some of its thresholds. Matching is divided into two steps computing the epipolar geometry with different distances (20 and 2 pixels ) to the epipolar line.

After the matching between two frames is computed a triangulation of the landmarks is performed to get and initial reconstruction of the scene.

The trajectory estimation relies on the pose estimation process. It has as input a set of 3D-2D correspondences and the correspondeces between the two frames. In this approach the orientation estimation is decoupled from the position estimation.

Epipolar geometry provides the relative position of the new camera up to scale. Only the direction of the translation is estimated. Only one 3D point is required to recover that scale. In doing so, the estimated camera position is consistent with both 3D points and epipolar geometry. Two 3D-2D correspondences are used to estimate the full camera position while fixing its orientation (not only the scale). Preemptive RANSAC is performed followed by iterative refinement.

Triangulation. In this step the orientation is estimated by taking a small set of high quality points. If only 2 landmarks are available close-form, two view algorithm is performed. Otherwise a DLT algorithm for n-views is performed.

FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance

Main idea: SLAM in the space appearance using a probabilistic approach. It is based on the bag-of-words retrieval systems (Níster06). It learns a generative model of the bag-of-words data, i.e. It can identify if a new observation is already included in the map or if it comes from a previously unseen place by using a probabilistic framework.

This approach uses tools as:

Chow Liu trees which aproximates large discrete distributions. This tree is computed off-line.
A semi-external spanning tree algorithm is used to deal with the size of the Chow Liu tree

Probabilistic Navigation using Appearance

The world is modeled as a set of discrete locations.
Each location is described by a distribution over appearance words.
Incoming sensory data is converted into a bag-of-words representation and compared with the stored information.
There is an expresion for the probability that this observation comes either from the location's distributon or from a place which is not in the map.
This approach uses binary features based on SURF detector/descriptor.

Approximations and Parameters

Sets of observations are conditionally independient given position.
Detector behavior is idependient of position.
Location models are generated independently by the environment.
Observations of one feature don't inform about the existence of other features.

Friday, February 20, 2009

Topological SLAM (Tapus, Siegwart, 2005)

Main idea: This approach proposes the concept of fingerprints of places to localize the robot and to build a topological map of the environment with some geometric information. It uses Partially Observable Markov Decision Processes (POMDP).

Fingerprint of places: is a circular list of features, where the ordering of the set matches the relative ordering of the features around the robot.

Uncertainty of features. The probability of a feature being present in the environment when robot perceives it.

Bayesian approach for localization with fingerprints of places:

Supervised learning. Robot explores several locations and stores the fingerprints of the visited places in a database with the name of the place.
Application. The robot localizes itself in the environment by acquiring a fingerprint and comparing it with the fingerprints stored. It uses a Bayesian fingerprint matching.

Topological mapping:

The robot first creates and then updates the global topological map.
A new node is introduced into the topological map whenever the dissimilarity of the newly perceived fingerprint is larger than a threshold.
Each node is composed of set of similar fingerprints of places using a mean fingerprint.

Topological Navigation

A Partially Observable Markov Decision Process (POMDP) model is used.
POMDP integrates both the robot's motion and exetroceptive sensor report data to estimate the pose distribution.

POMDP is defined as

S = set of environment states
A = set of actions
T(s,a,s') = transition function between environment states based on the performed action
O = possible observations.

Indoor Control Strategy

The entropy of the probability distribution over the states of the topological map is used.
When the robot is confident, the action that is optimal to that state is executed.
When the robot is not confident about its state, the robot uses

mid-line following if the previous action was mid-line following

leave the room if the prevoius action was go to the center of the free space

Map Update

The strategy for updating the map is:

When the entropy of the belief state is low enough, the map will be updatedand so the fingerprint and the uncertainty of the features will also be updated.
If the entropy is above the threshold, then updating will not be allowed,and the robot will try to reduce the entropy by continuing navigation with localization.

Closing the loop

Non-explicit loop-closing algorithm. Based on the information provided by the POMDP when two distribution probabilities are observed a loop is identified.

A Distributed Cognitive Map for Spatial Navigation Based on Graphically Organized Place Agents

Main idea: To build a cognitive map from representations given by independient Place Agents. There no exist a global map but local representations given by Place Agents. The union of all local representations give the global map.

Vision system: Stereo camera.

Event representation:

1D - data representing the orientation or movement direction of the robot.
2D - log-polar grid where each bin represents distances to the robot. The data represented there are the distance and the angle. The closest bins represent a more precise space measurement (5cm) while the ones in the periphery area represent bigger distances (5m).
3D - this representation is a combination of 2D and 1D representations. So it represents 1D orientation and 2D position.

Depending on the type of the sensor the event reported by it.

Obstruction sensors = Laser-Range-Finder, Infra-Red-Transceivers
Object-Detection- and Object-Classification Sensors = Vision
Heading-Direction Sensors = Compass, Gyroscope
Ego Motion = Wheel Integration, Optic Flow
Abstract events = Neighbors, Explored Space

For each sensor they use a different local grid. These grids are stored in containers O and can be addressed at once or individually.

Incoming events are separated and multiple slices of representations are kept, separated by time, distance traveled and angle. To have a full description of the environment all these slices are merged in a place-signature when a comparison with an already visited place is required.

The comparison between two containers O1 and O2 is performed by convolution, i.e. O2 is rotated and translated and compared to O1 until the offset between the two containers is minimum. This comparison is performed with the three types of events representations.

Functionality of this approach (a few examples):

add(single slice)
add(other O)
enterNewSensorData(id,data)
getPlaceSignature()
setDecay(delta_t,delta_d)
distance(other O)
findFreeDrivingAngle()
findBestPlace()
getNeighborList()

Driving control. The robot is able to perform actions such as "target in 2m distance at 30 degrees angle".

Local Obstacle Avoidance. The robot permanently monitors its internal representation of the current local environment and compute a detour trajectory if possible trajectories found by a breadth-first search on a grid of hypothetical free positions in increasing steps from robot's the current position.

Path Integration. The robot continually updates a 3-dimensional path-integration vector representing distance, direction and orientation traveled. This vector can be modified by the Place Agents (PA). The robot provides this information to the currently active PA, which might use it to estimate distances to its neighbors.

Note: It uses artificial visual landmarks.

Visit to GRASP Lab