An Introduction to Simultaneous Localisation and Mapping
Simultaneous Localisation and Mapping (SLAM) is becoming an increasingly important topic within the computer vision community, and is receiving particular interest from the augmented and virtual reality industries. With a variety of SLAM systems being made available, from both academia and industry, it is worth exploring what exactly is meant by SLAM. This article will give a brief introduction to what SLAM is (and what it isn't), what it's for, and why it's important, in the context of computer vision research and development, and augmented reality in particular.
What is SLAM?
'SLAM' is not a particular algorithm or piece of software, but rather it refers to the problem of trying to simultaneously localise (i.e. find the position/orientation of) some sensor with respect to its surroundings, while at the same time mapping the structure of that environment. This can be done in a number of different ways, depending on the situation; so when I refer to a 'SLAM system', take that to mean 'a set of algorithms working to solve the simultaneous localisation and mapping problem'.
SLAM is not necessarily a computer vision problem only, and need not involve visual information at all: in fact much of the early research on SLAM involved ground based robots equipped with laser scanners. However, for this article I will be discussing mostly visual SLAM - where the primary mode of sensing is via a camera - since it is of most interest in the context of augmented reality, but many of the themes discussed can apply more generally.
The requirement of recovering both the camera's position and the map, when neither are known to begin with, distinguishes the SLAM problem from other tasks. For example, marker-based tracking is not SLAM, because the marker image (analogous to the map) is known beforehand. 3D reconstruction with a fixed camera rig is not SLAM either, because while the map (here the model of the object) is being recovered, the positions of the cameras are already known. The challenge in SLAM is to recover both camera pose and map structure while initially knowing neither.
An important distinction between SLAM and other apparently similar methods of recovering pose and structure is the requirement that it must operate in real time. This is a somewhat slippery concept, but in general it means that the processing done on each incoming camera image must be finished by the time the next one arrives, so that the pose of the camera is available immediately and not as a result of a post-processing stage. This distinguishes SLAM from techniques like structure from motion, where a set of unordered images are processed offline to recover the 3D structure of an environment, in a potentially (extremely) time consuming process. This can achieve impressive results (see here example), but crucially can't tell you where the camera is during acquisition.
A Brief History of SLAM
Research on the SLAM problem began within the robotics community (arguably with the 1986 papers by Smith and Cheeseman), usually with wheeled robots traversing a flat ground plane. Typically this was done by combining sensor readings (such as from a laser scanner or sonar) with information about the control input (e.g. steering angle) and the measured robot state (e.g. counting wheel rotations). This may seem far removed from tracking a handheld camera freely moving in space, but embodies many of the core difficulties of SLAM, such as creating a consistent and accurate map, and making best use of multiple unreliable sources of information.
More recently the use of visual sensors has become an important aspect of SLAM research, in part because an image provides a rich source of information about the structure of the environment (containing more information than a sonar ping for example). A large amount of research on visual SLAM used stereo cameras, or cameras alongside other sensors (such as accelerometers or GPS), but since around 2001 a number of works showed how SLAM could be done successfully using only a single camera (known as monocular visual SLAM), for example the seminal work of Andrew Davison at the University of Oxford.
This was crucial in making SLAM a more widely useful technology, since devices equipped with just a single camera - such as webcams and mobile phones - are vastly more common and accessible than specialist sensing equipment. More recent work has demonstrated how monocular visual SLAM can be used to create large scale maps, how the maps can be automatically enhanced with meaningful 3D structures, and recover extremely detailed shapes in real time. SLAM is an active field of research within computer vision and new and improved techniques are constantly emerging.
How SLAM Works
The majority of modern visual SLAM systems are based on tracking a set of points through successive camera frames, and using these tracks to triangulate their 3D position; while simultaneously using the estimated point locations to calculate the camera pose which could have observed them. It may seem that it is impossible to compute one without the other - indeed from only a few points it is - but by observing a sufficient number of points it is possible to solve for both structure and motion. Even with a single camera, by carefully combining measurements of points made over multiple frames it is possible to recover pose and structure with high accuracy, up to an unknown global scale factor. The following example shows the forthcoming Kudan SLAM System tracking a set of points (left), enabling the position and orientation of the camera to be estimated with respect to the map (right):
At its heart, SLAM is an optimisation problem, where the goal is to compute the best configuration of camera poses and point positions in order to minimise reprojection error (the difference between a point's tracked location and where it is expected to be given the camera pose estimate, over all points). The method of choice to solve this problem is called bundle adjustment, a nonlinear least squares algorithm which, given a suitable starting configuration, iteratively approaches the minimum error for the whole system.
The problem with bundle adjustment is that it can be very time consuming to find the best solution, and the time taken grows quickly as the size of the map increases. With the advent of multi-core machines this is solved by separating the localisation from the mapping. This means localisation - the tracking of points in order to estimate the current camera pose - can happen in real time on one thread, while the mapping thread can run bundle adjustment on the map in the background. On completion the mapping thread updates the map used for tracking, and in turn the tracker adds new observations to expand the map. This can be seen in the above video, where tracking is continuous but the map is only optimised periodically.
Aside from details such as the method of tracking points, initialisation of the map, robustness to incorrect matches and various strategies for optimising the map more efficiently (which are beyond the scope of this article), the above describes the basic operation of a SLAM system. Beyond this, additional functionality is often incorporated in order to make SLAM more practical in the real world. For example, a crucial feature for larger scale mapping is loop closure, whereby the gradual accumulation of errors over time can be ameliorated by associating the current tracked location to an earlier part of the map, thereby enforcing extra constraints on the optimisation problem to get an overall more correct map.
Another essential technique is known as relocalisation, which is able to cope with temporary poor tracking performance, which could otherwise cause the system to fail completely. This allows the tracker to be restarted by finding which part of the previously visited map looks most like the current camera view. This is demonstrated below using our SLAM system, where tracking fails (due to moving away from the mapped area before it can be expanded) and is recovered as soon as the map is seen again. This also solves the 'kidnapped camera' problem, where the camera view is blocked while being moved to a different location: tracking resumes successfully even after changing the camera's viewpoint.
The above gives a quick overview of what the SLAM problem involves and how one would go about solving it. But why is this important and how is a SLAM system useful? As per the original robotics research, localising a camera within an unknown environment is useful when exploring and navigating terrain for which no prior map (or GPS signal) is available - it is even being used to explore the surface of Mars.
This ability to accurately localise a camera without a prior reference is also crucial to its use in agumented reality. This is how virtual content can be locked in place with respect to the real world: because the real world is what makes the map, and the camera which takes the images to be augmented has its position estimated constantly. As such, fast and accurate localisation is crucial (otherwise lag or drift in the rendered graphics would be apparent). The difference between SLAM and something like our marker tracker is that it is not necessary to have a pre-specified image target to create augmentations; nor is it necessary for any particular objects to remain in view.
More sophisticated augmented reality applications are possible due to the fact that, as a necessary step in performing localisation, a map is being built. This enables applications beyond simply adding virtual content into the world coordinate frame, where virtual content can be aware of and react to real objects present in the scene.
Rather than building up a map of a surrounding environment, a SLAM system can equally well be used to build a 3D reconstruction of an object, simply by moving a camera around it. For example, the forthcoming Kudan Slam System is used here to show the process of creating a point-based 3D model of a complicated object, then to track it while deliberately ignoring the background:
It is important to note that while SLAM enables a wide range of applications, there are some things for which it is not the right tool. SLAM systems work on the assumption that the camera moves through an unchanging scene, which makes it unsuitable for tasks such as person tracking and gesture recognition: both of these involve non-rigidly deforming objects and a non-static map, and while these tasks can indeed be tackled by computer vision, they are not 'mapping' problems. Similarly vision tasks such as face recognition, image understanding and classification are not related to SLAM.
Nevertheless, as shown above, a number of important tasks such as tracking moving cameras, augmented reality, map reconstruction, interactions between real and virtual objects, object tracking and 3D modelling can all be accomplished using a SLAM system, and the availability of such technology will lead to further developments and increased sophistication in augmented reality applications.