{"id":477,"date":"2020-08-20T13:37:45","date_gmt":"2020-08-20T13:37:45","guid":{"rendered":"http:\/\/kudan.io\/jp\/?p=477"},"modified":"2020-11-17T13:53:34","modified_gmt":"2020-11-17T13:53:34","slug":"visual-slam-the-basics","status":"publish","type":"post","link":"https:\/\/www.kudan.io\/jp\/archives\/477","title":{"rendered":"Visual SLAM: The Basics"},"content":{"rendered":"<p>\u200bSLAM\u306b\u306f\u591a\u304f\u306e\u5f62\u5f0f\u3084\u624b\u6cd5\u304c\u3042\u308a\u307e\u3059\u304c\u3001\u307e\u305a\u306f\u6a5f\u80fd\u30d9\u30fc\u30b9\u306evisual SLAM\u304b\u3089\u59cb\u3081\u307e\u3057\u3087\u3046\u3002Direct Visual SLAM\u3084\u3001\u6df1\u5ea6\u30bb\u30f3\u30b5\u4ed8\u304d\u30ab\u30e1\u30e9\u3092\u4f7f\u7528\u3059\u308bSLAM\u3001LiDAR\u306a\u3069\u306e\u4ed6\u306eSLAM\u624b\u6cd5\u306b\u3064\u3044\u3066\u306f\u3001\u4ee5\u964d\u306e\u8a18\u4e8b\u3067\u8aac\u660e\u3057\u307e\u3059\u3002<\/p>\n<p>\u307e\u305a\u306f\u3001Visual SLAM\u306b\u3064\u3044\u3066\u3001\u6982\u8aac\u3057\u307e\u3059\u3002\uff08\u4ee5\u4e0b\u3001\u82f1\u6587\u306e\u307f\uff09<\/p>\n<hr \/>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-491 aligncenter\" src=\"http:\/\/kudan.io\/jp\/wp-content\/uploads\/2020\/11\/Rainbow-Shopping-Point-Cloud-Short.gif\" alt=\"\" width=\"640\" height=\"347\" \/><\/p>\n<p>In my <a href=\"http:\/\/kudan.io\/jp\/archives\/413\" target=\"_blank\" rel=\"noopener noreferrer\">last article<\/a>, we looked at SLAM from a 16km (50,000 feet) perspective, so let\u2019s look at it from 2m. Not close enough to get your hands dirty, but enough to get a good look over someone\u2019s shoulders. SLAM can take on many forms and approaches, but for our purpose, let\u2019s start with feature-based visual SLAM. I will cover other SLAM approaches such as direct visual SLAM, and those that use cameras with depth sensors, and LiDAR in subsequent articles.<\/p>\n<p>As the name implies, visual SLAM utilizes camera(s) as the primary source of sensor input to sense the surrounding environment. This can be done either with a single camera, multiple cameras, and with or without an inertial measurement unit (IMU) that measure translational and rotational movements.<\/p>\n<p>Let\u2019s walk through the process chart and see what happens at each stage.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-478 aligncenter\" src=\"http:\/\/kudan.io\/jp\/wp-content\/uploads\/2020\/11\/01-Feature-based-VSLAM-Process.png\" alt=\"\" width=\"700\" height=\"72\" srcset=\"https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/01-Feature-based-VSLAM-Process.png 700w, https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/01-Feature-based-VSLAM-Process-300x31.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/p>\n<h3><strong>Sensor\/Camera Measurements: Setup<\/strong><\/h3>\n<p>In order to make this a bit more concrete, let\u2019s imagine a pair of augmented reality glasses. For simplicity, the glasses have two cameras mounted at the temple and an IMU centered between the cameras. The two cameras provide the stereo vision to make depth estimations easier, and the IMU will help provide better movement estimations.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-479 aligncenter\" src=\"http:\/\/kudan.io\/jp\/wp-content\/uploads\/2020\/11\/02-Glasses-2.png\" alt=\"\" width=\"254\" height=\"127\" \/><\/p>\n<p>There are a couple specifications for the camera that help with SLAM: global shutter and grayscale sensor. Also, the tracking cameras don\u2019t have to be super high-resolution, and typically a VGA (640x480 pixel) camera is sufficient (more pixels, more processing). Let\u2019s also assume there is a 6-axis IMU: motion on the x, y and z-axis, and pitch, yaw and roll. Finally, these sensors should be synchronized against a common clock to match the sensor outputs against each other.<\/p>\n<h3><strong>Let's start solving the puzzle<\/strong><\/h3>\n<p>I find the process of completing a jigsaw puzzle as a good analogy for some major components of the SLAM process. The processes described below are mostly conceptual and simplified to help in understanding the overall mechanisms involved.<\/p>\n<p>When the system is initialized, and the cameras are turned on, you are given your first piece of the puzzle. You don\u2019t know how many pieces there are, and you don\u2019t know what part of the puzzle you are looking at. You have your first stereo images and IMU readings.<\/p>\n<h3><strong>Feature Extraction: Distortion correction<\/strong><\/h3>\n<p>Most camera lenses will introduce some level of distortion to the captured images. There will be distortion from the design of the lenses, as well as distortion in each lens from minute differences during manufacturing. We can \u201cundistort\u201d the image through a distortion grid that transforms the image close to its original representation.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-480 aligncenter\" src=\"http:\/\/kudan.io\/jp\/wp-content\/uploads\/2020\/11\/03-Distortion-Correction-2.png\" alt=\"\" width=\"596\" height=\"500\" srcset=\"https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/03-Distortion-Correction-2.png 596w, https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/03-Distortion-Correction-2-300x252.png 300w\" sizes=\"auto, (max-width: 596px) 100vw, 596px\" \/><\/p>\n<h3><strong>Feature Extraction: Feature points<\/strong><\/h3>\n<p>Features in computer vision can take on a number of forms, and don\u2019t necessarily correspond to what humans think of as features. Features typically take the form of corners, or blobs, a collection of pixels that uniquely stand out, and should be able to be consistently identified from an image, and occasionally edges. The figure below depicts the features detected (left), how they would be represented in a map (right).<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-481 aligncenter\" src=\"http:\/\/kudan.io\/jp\/wp-content\/uploads\/2020\/11\/04-Feature-Points-3.png\" alt=\"\" width=\"960\" height=\"338\" srcset=\"https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/04-Feature-Points-3.png 960w, https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/04-Feature-Points-3-300x106.png 300w, https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/04-Feature-Points-3-768x270.png 768w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/><\/p>\n<h3><strong>Feature Extraction: Feature matching and depth estimation<\/strong><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-482 aligncenter\" src=\"http:\/\/kudan.io\/jp\/wp-content\/uploads\/2020\/11\/05-Stereo-Features.png\" alt=\"\" width=\"622\" height=\"303\" \/><\/p>\n<p>Given these stereo images, we should be able to see overlapping features between the images. These identical features can then be used to estimate the distance from the sensor. We know the orientation of the cameras and the distance between them. We use this information to perform image rectification - the mapping pixels between the two images against a common plane. This is then used to determine the disparity of the common features between the two images. Disparity and distance are inversely related, such that as the distance from the camera increases, the disparity decreases.<\/p>\n<p>Now, we can estimate the depth of each of the features using triangulation.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-483 aligncenter\" src=\"http:\/\/kudan.io\/jp\/wp-content\/uploads\/2020\/11\/06-Triangulation.png\" alt=\"\" width=\"265\" height=\"395\" srcset=\"https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/06-Triangulation.png 265w, https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/06-Triangulation-201x300.png 201w\" sizes=\"auto, (max-width: 265px) 100vw, 265px\" \/><\/p>\n<p>In a single camera scenario, we cannot infer depth from a single image, but as the camera moves around, depth can be inferred through parallax by comparing the features in subsequent images.<\/p>\n<h3><strong>Data association<\/strong><\/h3>\n<p>The data association step takes the features detected, along with its estimated location in space, and builds a map of these features with regards to the cameras. As this process continues through subsequent frames, the system continually takes new measurements, and associates features to known elements of the map, and prunes uncertain features.<\/p>\n<p>As we track the motion of the camera, we can start making predictions based on the known features, and how they should change based on the motion.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-484 aligncenter\" src=\"http:\/\/kudan.io\/jp\/wp-content\/uploads\/2020\/11\/07-Observations.png\" alt=\"\" width=\"465\" height=\"310\" srcset=\"https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/07-Observations.png 465w, https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/07-Observations-300x200.png 300w\" sizes=\"auto, (max-width: 465px) 100vw, 465px\" \/><\/p>\n<p>The constraint of computing resources and time (especially real-time requirements) creates a forcing function for SLAM, where the process becomes a tradeoff between map accuracy, and processing resource and time.\u00a0As the measurements of features, and location\/pose increase over time, representation of the observed environment has to be constrained and optimized.\u00a0We\u2019ll take a look at some of these tools, and different approaches to optimizing the model.<\/p>\n<h3><strong>Location, Pose and Map Update: Kalman filters<\/strong><\/h3>\n<p>As the camera moves through space, there is increasing noise and uncertainty between the images the camera captures and its associated motion.\u00a0Kalman filters reduce the effects of noise and uncertainty among different measurements to model a linear system more accurately by continually making predictions, updating and refining the model against the observed measurements.\u00a0For SLAM systems, we typically use extended Kalman filters (EKF), which takes nonlinear systems, and linearizes the predictions and measurements around their mean.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-485 aligncenter\" src=\"http:\/\/kudan.io\/jp\/wp-content\/uploads\/2020\/11\/08-SLAM-with-Filters-Process.png\" alt=\"\" width=\"907\" height=\"213\" srcset=\"https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/08-SLAM-with-Filters-Process.png 907w, https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/08-SLAM-with-Filters-Process-300x70.png 300w, https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/08-SLAM-with-Filters-Process-768x180.png 768w\" sizes=\"auto, (max-width: 907px) 100vw, 907px\" \/><\/p>\n<p>Utilizing a probabilistic approach, Kalman filters take into account all the previous measurements and associate the features to the latest camera pose through the use of a state vector and a covariance matrix for each feature against one another.\u00a0However, all noise and states are assumed to be Gaussian.\u00a0As you can imagine as the tracked points grow, the computation becomes quite expensive, and harder to scale.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-486 aligncenter\" src=\"http:\/\/kudan.io\/jp\/wp-content\/uploads\/2020\/11\/09-After-Kalman-Filter.png\" alt=\"\" width=\"487\" height=\"373\" srcset=\"https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/09-After-Kalman-Filter.png 487w, https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/09-After-Kalman-Filter-300x230.png 300w\" sizes=\"auto, (max-width: 487px) 100vw, 487px\" \/><\/p>\n<h3><strong>Location, Pose and Map Update: Particle filters<\/strong><\/h3>\n<p>In contrast to Kalman filters, particle filters treat each feature point as a particle in space with some level of positional uncertainty. At each measurement this uncertainty is updated (normalized and re-weighted) against the predicted position with regard to the camera movement. Unlike Kalman filters, particle filters can handle noise from any distribution, and states can have a multi-modal distribution.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-487 aligncenter\" src=\"http:\/\/kudan.io\/jp\/wp-content\/uploads\/2020\/11\/10-After-Particle-Filter.png\" alt=\"\" width=\"517\" height=\"386\" srcset=\"https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/10-After-Particle-Filter.png 517w, https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/10-After-Particle-Filter-300x224.png 300w\" sizes=\"auto, (max-width: 517px) 100vw, 517px\" \/><\/p>\n<h3><strong>Location, Pose and Map Update: Bundle adjustment<\/strong><\/h3>\n<p>As the number of points being tracked in space along with corresponding camera poses increase, bundle adjustment is an optimization step that performs a nonlinear least squares operation on the current model.\u00a0Imagine a \u201cbundle\u201d of light rays from all the features connected to each of the camera observations, and \u201cadjusted\u201d to optimize these connections directly to the sensor position and orientation as in the figure below.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-488 aligncenter\" src=\"http:\/\/kudan.io\/jp\/wp-content\/uploads\/2020\/11\/11-After-Bundle-Adjustment.png\" alt=\"\" width=\"901\" height=\"310\" srcset=\"https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/11-After-Bundle-Adjustment.png 901w, https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/11-After-Bundle-Adjustment-300x103.png 300w, https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/11-After-Bundle-Adjustment-768x264.png 768w\" sizes=\"auto, (max-width: 901px) 100vw, 901px\" \/><\/p>\n<p>Bundle adjustment is a batch operation, and not performed on every captured frame.<\/p>\n<h3><strong>Location, Pose and Map Update: Keyframe<\/strong><\/h3>\n<p>Keyframes are select observations by the camera that capture a \u201cgood\u201d representation of the environment.\u00a0Some approaches will perform a bundle adjustment after every keyframe.\u00a0 Filtering becomes extremely computationally expensive as the map model grows, however keyframes enable more feature points or larger maps, with a balanced tradeoff between accuracy and efficiency.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-489 aligncenter\" src=\"http:\/\/kudan.io\/jp\/wp-content\/uploads\/2020\/11\/12-After-Keyframe-Selection.png\" alt=\"\" width=\"487\" height=\"355\" srcset=\"https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/12-After-Keyframe-Selection.png 487w, https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/12-After-Keyframe-Selection-300x219.png 300w\" sizes=\"auto, (max-width: 487px) 100vw, 487px\" \/><\/p>\n<h3><strong>Post-update<\/strong><\/h3>\n<p>Once\u00a0the update step completes, the 3D map of the current environment is updated, and the position and orientation of the sensor within this map is known.\u00a0There are two important concepts that loosely fit into this final step - a test to see if the system has been here before, and what happens when the system loses tracking or gets lost.<\/p>\n<h3><strong>Post update: Loop closure<\/strong><\/h3>\n<p>As the system continues to move through space and build a model of its environment, the system will continue to accumulate measurement errors and sensor drift, which will be reflected in the map being generated.\u00a0Loop closure occurs when the system recognizes that it is revisiting a previously mapped area, and connects previously unconnected parts of the map into a loop, correcting the accumulated errors in the map.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-490 aligncenter\" src=\"http:\/\/kudan.io\/jp\/wp-content\/uploads\/2020\/11\/13-Loop-Closure.png\" alt=\"\" width=\"960\" height=\"344\" srcset=\"https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/13-Loop-Closure.png 960w, https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/13-Loop-Closure-300x108.png 300w, https:\/\/www.kudan.io\/jp\/wp-content\/uploads\/2020\/11\/13-Loop-Closure-768x275.png 768w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/><\/p>\n<h3><strong>Post update: Relocalization<\/strong><\/h3>\n<p>The term localization in SLAM is the awareness of the system\u2019s orientation and position within the given environment and space.\u00a0Relocalization occurs when a system loses tracking (or initialized in a new environment), and needs to assess its location based on currently observable features.\u00a0If the system is able to match the features it observes against the available map, it will localize itself to the corresponding pose in the map, and continue the SLAM process.<\/p>\n<h3><strong>Final words<\/strong><\/h3>\n<p>With the goal of trying not to be too mathy and technical, and yet conceptually descriptive enough to help get a depth of understanding of the processes that take place within one type of SLAM system, this goes a bit beyond my \u201c5 minute read\u201d target, but I think it\u2019s essential to cover these fundamental concepts for visual SLAM to help with future topics.<\/p>\n<p>Let me know your thoughts, comments and questions.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u200bSLAM\u306b\u306f\u591a\u304f\u306e\u5f62\u5f0f\u3084\u624b\u6cd5\u304c\u3042\u308a\u307e\u3059\u304c\u3001\u307e\u305a\u306f\u6a5f\u80fd\u30d9\u30fc\u30b9\u306evisual SLAM\u304b\u3089\u59cb\u3081\u307e\u3057\u3087\u3046\u3002Direct Visual SLAM\u3084\u3001\u6df1\u5ea6\u30bb\u30f3\u30b5\u4ed8\u304d\u30ab\u30e1\u30e9\u3092\u4f7f\u7528\u3059\u308bSLAM\u3001LiDAR\u306a\u3069\u306e\u4ed6\u306eSLAM\u624b\u6cd5\u306b\u3064\u3044 [&hellip;]<\/p>\n","protected":false},"author":6,"featured_media":242,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"sns_share_botton_hide":"","vkExUnit_sns_title":"","_vk_print_noindex":"","sitemap_hide":"","_veu_custom_css":"","veu_display_promotion_alert":"","vkexunit_cta_each_option":"","_lightning_design_setting":[],"header_top_description":"","footnotes":""},"categories":[3],"tags":[47,16,73,7,13,74,63,30,61],"class_list":["post-477","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog","tag-artificial-perception","tag-artisense","tag-computer-vision","tag-kudan","tag-kudanslam","tag-machine-perception","tag-mapping","tag-slam","tag-vslam"],"veu_head_title_object":{"title":"","add_site_title":""},"_links":{"self":[{"href":"https:\/\/www.kudan.io\/jp\/wp-json\/wp\/v2\/posts\/477","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kudan.io\/jp\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kudan.io\/jp\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kudan.io\/jp\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kudan.io\/jp\/wp-json\/wp\/v2\/comments?post=477"}],"version-history":[{"count":2,"href":"https:\/\/www.kudan.io\/jp\/wp-json\/wp\/v2\/posts\/477\/revisions"}],"predecessor-version":[{"id":494,"href":"https:\/\/www.kudan.io\/jp\/wp-json\/wp\/v2\/posts\/477\/revisions\/494"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kudan.io\/jp\/wp-json\/wp\/v2\/media\/242"}],"wp:attachment":[{"href":"https:\/\/www.kudan.io\/jp\/wp-json\/wp\/v2\/media?parent=477"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kudan.io\/jp\/wp-json\/wp\/v2\/categories?post=477"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kudan.io\/jp\/wp-json\/wp\/v2\/tags?post=477"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}