Visit to GRASP Lab: February 2009

Tuesday, February 24, 2009

Analysis of methods

We can use as base the approach at GRASP lab:

It uses a geometric approach (epipolar geometry) 5 and 3-point algorithms combined with a Preemptive RANSAC to compute accurately the position of the camera.
It also reconstruct the scene from 3D-2D correspondences.
Based on that is able to perform localization and mapping.

Loop-closing is performed using a bag-of-words-based approach.

GPS data is used to restrict the search area. (This is not possible in indoors).

It also uses epipolar geometry to avoid wrong matches.

To apply this approach in a topological layer we should:

Define the nodes in the topological map. These can be represented as:

Fingerprints (Tapus 2005)
Epitomes (Ni 2008)
unsupervised learning approach used to create clusters based on the information present in the acquired images. (Kuipers 2002, Bowling 2006)

Define the relation between these nodes:

Metric information (distance, angle)
Conectivity to the neighbors.

Define the actions to be performed by the robot. These actions can be:

Explore new space
Go to the center of free space
Leave room
Follow the mid-line
Etc

Some works that use a topological approach to perform appearance-based SLAM are the following:

Appearance-Based Topological Bayesian Inference for Loop-Closing Detection in a Cross-Country Environment (Chen and Wang 2006).

They use PCA to model the environment's appearance.
Their distributions are approximated by a series of Gaussian models.

FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance

It is based on the bag-of-words retrieval systems (Níster06).
It can identify if a new observation is already included in the map or if it comes from a previously unseen place by using a probabilistic framework (loop-closing).
It uses SIFT/SURF detector/descriptor.

See the references in the blog.

Proposal:

FAB-MAP can be extended adding metrical information provided by epipolar geometry and scene reconstruction.

Appearance-Based Topological Bayesian Inference for Loop-Closing Detection in a Cross-Country Environment

A visual bag of words method for interactive qualitative localization and mapping

Main idea: The bag-of-words approach is adapted to perform visual localization. It is based in the creation of a dictionary from the words (features) acquired from images. Then the localization is performed by comparing the words from a new scene with the ones stored in the dictionary.

Visual localization

The goal of the classifier is to infer the room from an image.
The classifier can be trained incrementally by a vote scheme.
Active learning is performed when then classifier fails and the user has to provide the right label for the place that was not well classified.

more detailed (assume the map is already built)

The features are extracted and the corresponding words are found in the dictionary. These words then vote at a first level for the rooms in which they have been perceived.
The vote result is calculated by the difference between the maximum and the second maximum.
The winning category votes at a second level.
This process is repeated with the other feature spaces and with new images until the quality of the second level vote reaches a given threshold.

Mapping is performed in two processes

Building the dictionary
Gathering data for the classifier
User-aided approach
Memorize in which category a given word has been perceived

Features used

SIFT
local color histograms
local normalized grey level histograms

Monday, February 23, 2009

Epitomic Location Recognition

Main idea: It uses a generative model based on epitomic image analysis. This analysis is based in a probabilistic approach.

The appearance and geometric structure of the environment is captured into this representation.
It has the ability to model translation and scale invariance together with the fusion of diverse visual features yield enhanced generalization with economical training.
The recognition of a location class is achieved by convolving the query image and the learned epitome.
It doesnt estimate the accurately the camera position.
Occlusions, reflections or non-rigid motions are modeled as noise whose variance changes for different regions within the environment.
These epitomes are generative, probabilistic models and various sources of uncertainty are captured in the variance maps.
In this model, an image is extracted from a larger latent image, the epitome, at a location given by a discret mapping.
Every N x M image I is generated from a Ne x Me location epitome e.
In this approach Ne x Me translations and 3 scales are considered.

Inference and learning

Every image is independient and identically distributed given the epitome.
The goal is to find a single epitome e* which maximizes the probability of the observations. This is achivied by EM algorithm.

Visual features

raw RGB pixels
gist features
disparity maps
local histograms

It uses a stereo camera.

Experiments On Visual Loop Closing Using Vocabulary Trees

Main idea: Several vocabulary trees as GPS data are used to deal with the problem of visual loop closing for long trajectories in an urban environment. The loop closing is performed by searching in the tree the most similar frame to the current one. Epipolar geometry is used to prune out the bad matches.

To deal with the problem of weak definitions of loop-closing a latency test is performed. This consist on testing only images that have not been taken in the last n seconds but that have been taken before that from a viewpoint in a threshold radius. Only the matched images that hold the geometric constraint are considered.

Two types of trees are tested in this work:

Hierarchical K-means
Extremely Randomized Forests (ERT)

Summarizing this approach:

A vocabulary tree is built off-line
Each image of the sequence is added to the vocabulary tree using inverted files.
When the vehicle is close to a previously seen scene, the descriptors of the current image and the inverted files are used to obtain the N closest images.
Epipolar geometry is used to determine if the matches are good.

Monocular Visual Odometry in Urban Environments Using an Omnidirectional Camera

Main idea: Performing visual SLAM using Preemptive RANSAC with the 5-point and 2-point algorithm to compute the position accurately. It uses SIFT correspondeces to compute the epipolar geometry from which the translation and rotation are obtained in a decoupled way.

Landmark detection and tracking. This approach use detector descriptor and matching for SIFT points with a slight modification on some of its thresholds. Matching is divided into two steps computing the epipolar geometry with different distances (20 and 2 pixels ) to the epipolar line.

After the matching between two frames is computed a triangulation of the landmarks is performed to get and initial reconstruction of the scene.

The trajectory estimation relies on the pose estimation process. It has as input a set of 3D-2D correspondences and the correspondeces between the two frames. In this approach the orientation estimation is decoupled from the position estimation.

Epipolar geometry provides the relative position of the new camera up to scale. Only the direction of the translation is estimated. Only one 3D point is required to recover that scale. In doing so, the estimated camera position is consistent with both 3D points and epipolar geometry. Two 3D-2D correspondences are used to estimate the full camera position while fixing its orientation (not only the scale). Preemptive RANSAC is performed followed by iterative refinement.

Triangulation. In this step the orientation is estimated by taking a small set of high quality points. If only 2 landmarks are available close-form, two view algorithm is performed. Otherwise a DLT algorithm for n-views is performed.

FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance

Main idea: SLAM in the space appearance using a probabilistic approach. It is based on the bag-of-words retrieval systems (Níster06). It learns a generative model of the bag-of-words data, i.e. It can identify if a new observation is already included in the map or if it comes from a previously unseen place by using a probabilistic framework.

This approach uses tools as:

Chow Liu trees which aproximates large discrete distributions. This tree is computed off-line.
A semi-external spanning tree algorithm is used to deal with the size of the Chow Liu tree

Probabilistic Navigation using Appearance

The world is modeled as a set of discrete locations.
Each location is described by a distribution over appearance words.
Incoming sensory data is converted into a bag-of-words representation and compared with the stored information.
There is an expresion for the probability that this observation comes either from the location's distributon or from a place which is not in the map.
This approach uses binary features based on SURF detector/descriptor.

Approximations and Parameters

Sets of observations are conditionally independient given position.
Detector behavior is idependient of position.
Location models are generated independently by the environment.
Observations of one feature don't inform about the existence of other features.

Friday, February 20, 2009

Topological SLAM (Tapus, Siegwart, 2005)

Main idea: This approach proposes the concept of fingerprints of places to localize the robot and to build a topological map of the environment with some geometric information. It uses Partially Observable Markov Decision Processes (POMDP).

Fingerprint of places: is a circular list of features, where the ordering of the set matches the relative ordering of the features around the robot.

Uncertainty of features. The probability of a feature being present in the environment when robot perceives it.

Bayesian approach for localization with fingerprints of places:

Supervised learning. Robot explores several locations and stores the fingerprints of the visited places in a database with the name of the place.
Application. The robot localizes itself in the environment by acquiring a fingerprint and comparing it with the fingerprints stored. It uses a Bayesian fingerprint matching.

Topological mapping:

The robot first creates and then updates the global topological map.
A new node is introduced into the topological map whenever the dissimilarity of the newly perceived fingerprint is larger than a threshold.
Each node is composed of set of similar fingerprints of places using a mean fingerprint.

Topological Navigation

A Partially Observable Markov Decision Process (POMDP) model is used.
POMDP integrates both the robot's motion and exetroceptive sensor report data to estimate the pose distribution.

POMDP is defined as

S = set of environment states
A = set of actions
T(s,a,s') = transition function between environment states based on the performed action
O = possible observations.

Indoor Control Strategy

The entropy of the probability distribution over the states of the topological map is used.
When the robot is confident, the action that is optimal to that state is executed.
When the robot is not confident about its state, the robot uses

mid-line following if the previous action was mid-line following

leave the room if the prevoius action was go to the center of the free space

Map Update

The strategy for updating the map is:

When the entropy of the belief state is low enough, the map will be updatedand so the fingerprint and the uncertainty of the features will also be updated.
If the entropy is above the threshold, then updating will not be allowed,and the robot will try to reduce the entropy by continuing navigation with localization.

Closing the loop

Non-explicit loop-closing algorithm. Based on the information provided by the POMDP when two distribution probabilities are observed a loop is identified.

A Distributed Cognitive Map for Spatial Navigation Based on Graphically Organized Place Agents

Main idea: To build a cognitive map from representations given by independient Place Agents. There no exist a global map but local representations given by Place Agents. The union of all local representations give the global map.

Vision system: Stereo camera.

Event representation:

1D - data representing the orientation or movement direction of the robot.
2D - log-polar grid where each bin represents distances to the robot. The data represented there are the distance and the angle. The closest bins represent a more precise space measurement (5cm) while the ones in the periphery area represent bigger distances (5m).
3D - this representation is a combination of 2D and 1D representations. So it represents 1D orientation and 2D position.

Depending on the type of the sensor the event reported by it.

Obstruction sensors = Laser-Range-Finder, Infra-Red-Transceivers
Object-Detection- and Object-Classification Sensors = Vision
Heading-Direction Sensors = Compass, Gyroscope
Ego Motion = Wheel Integration, Optic Flow
Abstract events = Neighbors, Explored Space

For each sensor they use a different local grid. These grids are stored in containers O and can be addressed at once or individually.

Incoming events are separated and multiple slices of representations are kept, separated by time, distance traveled and angle. To have a full description of the environment all these slices are merged in a place-signature when a comparison with an already visited place is required.

The comparison between two containers O1 and O2 is performed by convolution, i.e. O2 is rotated and translated and compared to O1 until the offset between the two containers is minimum. This comparison is performed with the three types of events representations.

Functionality of this approach (a few examples):

add(single slice)
add(other O)
enterNewSensorData(id,data)
getPlaceSignature()
setDecay(delta_t,delta_d)
distance(other O)
findFreeDrivingAngle()
findBestPlace()
getNeighborList()

Driving control. The robot is able to perform actions such as "target in 2m distance at 30 degrees angle".

Local Obstacle Avoidance. The robot permanently monitors its internal representation of the current local environment and compute a detour trajectory if possible trajectories found by a breadth-first search on a grid of hypothetical free positions in increasing steps from robot's the current position.

Path Integration. The robot continually updates a 3-dimensional path-integration vector representing distance, direction and orientation traveled. This vector can be modified by the Place Agents (PA). The robot provides this information to the currently active PA, which might use it to estimate distances to its neighbors.

Note: It uses artificial visual landmarks.

Visit to GRASP Lab