Visit to GRASP Lab

Tuesday, November 24, 2009

Vanishing Point Computation

1. The vanishing point computation is performed using a vote scheme from all the lines detected in the scene. As an example we can observe the following image, where the dominant directions have been computed.

2. From the vanishing points the rotation matrix of the camera withrespect to the scene is computed

R = (VpX, VpY,VpZ) ^T

3. Then we use this matrix to rectify the images undoing the rotation R. For example we can see next images where the one on the left is the original image and the right one is the rectified one.

4. We were able to compute the VPs trough an indoors sequence. This sequence has 2200 frames. The rotation angles obtained from the rotation matrix are the following

5. From the last graphic we can observe that the camera is performing a planar motion and the x-axis gives us the orientation. The drastic changes in x-axis indicates a change of direction. This happens when the camera rotates around a corner (~90º). This situation can be observed here

Frame 685

Frame 686

6. Analyzing the rotations about the x-axis (alpha) we are able to detect the changes of orientation and with this information we can build a very general topological map based on rotations.

7. To avoid having all the frames and their respective rectification matrices we need to select some frames, we call these frames "keyframes". These keyframes will represent the nodes of our topological map. The keyframes are selected by computing the Essential matrix between two frames. To do this we use the 5-point algorithm

8. From the estimated Essential matrix we extract the rotation matrix and we compare it to the one extracted from the vanishing points.

Rxy_vanishing = R{frame_x}' * R{frame_y}

Rextracted = extract_R_from_E(E)

9. If the rotation matrix obained from the Essential matrixis congruent with the one computed from the visual compass we select another frame to compute a new Essential matrix. When we can't go on computing Essential matrices we select those frames as keyframe and start the process again from the begining. The output of this process will be the topological map.

Tuesday, April 7, 2009

Automatic Recovery of Relative Camera Rotations for Urban Scenes

Main idea: Based on the computation of the vanishing points (VP) and their matching through different images coming from different cameras, (hemispherically-tiled set of images captured from a single position in space) the computation of the relative camera rotation is performed.

This work first compute the vanishing points present in each image based on the direction of the edges extracted from the scene. The representation used for these edges is the intersection of the plane through the edge and the focal point with the Gaussian sphere. The direction of the 3D edges (vanishing points) is computed using an EM approach.

The vanishing point estimation is composed of two tightly-coupled sub-problems: classification (grouping observed edges into parallel sets) and estimation (finding the best VP for each set). The authors present an hybrid approach to VP detection and estimation which combines the Hough transform (HT) for detection with the acccuracy of least squares for estimation. An EM algorithm is formulated to probabilistically model edges and their directional uncertainty. A final verification step rejects spurious directions.

link: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=854809

Monday, April 6, 2009

Video Compass

Main idea: Extract the relative orientation of the camera with respect to the scene. This approach is based on the properties of the vanishing points. The vanishing points are computed from the intersection of parallel lines in uncalibrated images. It assumes that the majority of lines in man-made environments is alligned with the principal orthogonal directions of the world coordinate frame.

This approach can be summarized as follows:

Line Detection.
Lines are grouped into the dominant vanishing directions. In man-made environments are alligned with the principal orthogonal axes of the world reference frame.
Computation of the vanishing points

Two cases: calibrated and uncalibrated camera. This work assumes the uncalibrated case.

The grouping and the estimation of the vanishing points is performed at the same time using an EM approach.

The relative orientation of the camera wrt the scene is computed using the vanishing point constraints.

Calibration

The relation between image coordinates of a point and its counterpart in the calibrated and uncalibrated cases

λx = RX+ T and λx' = KRX+ KT

x' denotes a pixel coordinate of X and K is the intrinsic parameter matrix of the camera. The unit vectors of the world coordinate frame are e_i = [1,0,0]', e_j = [0,1,0]' and e_k = [0,0,1]' . The vanishing points corresponding to 3D lines parallel to either if these directions are

v_i = KRe_i
v_j = KRe_j
v_k = KRe_k

the orthogonallity relation between e_i, e_j, e_k readily provide constraints on the calibration matrix K. In particular

e'_i e_j = 0 -> v'_i K^(-T)RR^(-T)K¯¹v_j = v'_iK^(-T)K¯¹v_j = v'_i Sv_j = 0

where S is the metric associated with the uncalibrated camera S = K^(-T)K^(-1).
When three different vanishing points are detected, they provide three independient constraints on matrix S
v_i S v_j = 0, v'_i S v_k = 0 v'_j S v_k = 0
The zero skew constraint expresses the fact that the image axes are orthogonal can be written as [1, 0, 0]' S [0, 1, 0] = 0

with these constraints the matrix K can be computed by minimizing ||Bs||²

Relative orientation

since the vanishing directions are projections of the vectors associated with three orthogonal directions i, j, k, they depend on rotation only. In particular we can write that:

K¯¹v_i = Re_i K¯¹v_j = Re_j K¯¹v_k = Re_k

with each vanishing direction being proportional to the column of the rotation matrix R = [r_1, r_2, r_3]. Choosing the two best vanishing directions, properly normalizing them, the third row can be obtained as r_3 = cross(r_1,r_2) by enforcing the orthogonality constraints. There is a four way ambiguity in R due to the sign ambiguity in r1 and r2. Additional solutions can be eliminated by considering relative orientation or structure constraints.

Planar Ego-Motion Without Correspondences

Motion estimation as Hough

Camera viewing points P_i vectors relative to the camera's reference system.
P_i's are projected to p_i = P_i/||P_i || by a spherical projective projection
The camera undergoes a rigid motion in the X-Y plane

This motion is given by (R,t) where R = R_z(α) and t, a vector in theX-Y plane

t = R_z(θ)·e_1

From the camera's new coordinate frame, the world points are given as Q_i = RP_i + t and the world points Q_i map to image points q_i = Q_i/||Q_i ||
We infer that Rp_i, q_i and t lie on the same plane X-Y. This is expressed as

(Rz(α)pi × qi)'Rz(θ) e1 = 0.

Extension to any plane given by the angles β and γ.

The vector Rz(γ)Ry(β)e_3 is orthogonal to all vectors lying in the plane.

We can define any plane of motions with rotations.

Motions on the plane (β, γ) are simply a coordinate frame rotation from motions in the X-Y plane.
If the point pairs (p_i, q_i) satisfy a motion in the plane (β, γ), then the point pairs (p'_i, q'_i ), where {p'_i, q'_i } = Ry(−β)Rz(−γ){p_i, q_i} satisfie the same motion in the X-Y plane.

Epipolar constraint for general planar motions can be expressed as

Rz((Rz(α)(Rz(γ)Ry(β)' p_i × (Rz(γ)Ry(β))' q_i)'Rz(θ) e1 = 0

Tuesday, February 24, 2009

Analysis of methods

We can use as base the approach at GRASP lab:

It uses a geometric approach (epipolar geometry) 5 and 3-point algorithms combined with a Preemptive RANSAC to compute accurately the position of the camera.
It also reconstruct the scene from 3D-2D correspondences.
Based on that is able to perform localization and mapping.

Loop-closing is performed using a bag-of-words-based approach.

GPS data is used to restrict the search area. (This is not possible in indoors).

It also uses epipolar geometry to avoid wrong matches.

To apply this approach in a topological layer we should:

Define the nodes in the topological map. These can be represented as:

Fingerprints (Tapus 2005)
Epitomes (Ni 2008)
unsupervised learning approach used to create clusters based on the information present in the acquired images. (Kuipers 2002, Bowling 2006)

Define the relation between these nodes:

Metric information (distance, angle)
Conectivity to the neighbors.

Define the actions to be performed by the robot. These actions can be:

Explore new space
Go to the center of free space
Leave room
Follow the mid-line
Etc

Some works that use a topological approach to perform appearance-based SLAM are the following:

Appearance-Based Topological Bayesian Inference for Loop-Closing Detection in a Cross-Country Environment (Chen and Wang 2006).

They use PCA to model the environment's appearance.
Their distributions are approximated by a series of Gaussian models.

FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance

It is based on the bag-of-words retrieval systems (Níster06).
It can identify if a new observation is already included in the map or if it comes from a previously unseen place by using a probabilistic framework (loop-closing).
It uses SIFT/SURF detector/descriptor.

See the references in the blog.

Proposal:

FAB-MAP can be extended adding metrical information provided by epipolar geometry and scene reconstruction.

Appearance-Based Topological Bayesian Inference for Loop-Closing Detection in a Cross-Country Environment

A visual bag of words method for interactive qualitative localization and mapping

Main idea: The bag-of-words approach is adapted to perform visual localization. It is based in the creation of a dictionary from the words (features) acquired from images. Then the localization is performed by comparing the words from a new scene with the ones stored in the dictionary.

Visual localization

The goal of the classifier is to infer the room from an image.
The classifier can be trained incrementally by a vote scheme.
Active learning is performed when then classifier fails and the user has to provide the right label for the place that was not well classified.

more detailed (assume the map is already built)

The features are extracted and the corresponding words are found in the dictionary. These words then vote at a first level for the rooms in which they have been perceived.
The vote result is calculated by the difference between the maximum and the second maximum.
The winning category votes at a second level.
This process is repeated with the other feature spaces and with new images until the quality of the second level vote reaches a given threshold.

Mapping is performed in two processes

Building the dictionary
Gathering data for the classifier
User-aided approach
Memorize in which category a given word has been perceived

Features used

SIFT
local color histograms
local normalized grey level histograms