Google’s VPS – how well will it work?
Google has just announced that it is working on indoor positioning using computer vision; they call the result a Visual Positioning System (VPS). At first glance, this could appear to be mobile-friendly sparse SLAM technology. Since Kudan has made this type of SLAM system, we’re going to share some of our thoughts on the challenges Google faces and has potentially solved.
Is it actually SLAM?
Simultaneous localisation and mapping implies there is a mapping component, but is this really the case for VPS? If the goal is to find the device’s absolute position within a building, creating the map as you go isn’t likely to work well. Unless starting outside (where GPS is still accurate), an arbitrary starting point would only give you relative movement, and in all cases would be prone to issues with map drift. Loop closure is a possible solution but there is no guarantee that the user’s trajectory would create a detectable loop, so it would be preferable to minimise map drift in the first place, which is the approach Kudan takes. Realistically for this to work, maps need to be pre-recorded, and user devices only need concern themselves with the localisation task.
Enter Google Maps
Luckily Google has some prior experience with creating maps. We’ve all seen the interior street views, but what’s required in making an effective map for VPS?
What the mapping process might look like, Kudan’s SLAM building a map from a drone
Good relocalisation capabilities are critical since users might want to start their tracking journey from anywhere, so the map must provide sufficient data for the device to localise its position from a single camera frame. For a typical feature descriptor approach to relocalisation, the mapping phase will be required to record the environment from a series of different viewpoints, making multiple passes through each reachable part of the environment, aiming the sensors in different directions. Even though descriptors are space efficient, providing all of this data to a user’s device to let it localise itself is likely to be a bad idea. Google is much more likely to process relocalisation on its servers.
Once relocalised, the map’s purpose is to provide information for tracking. Prior knowledge of the device’s location makes tracking an easier task than relocalisation. In order to maintain typical real-time camera framerates, it needs to be performed on the device itself. Approaches to mobile tracking typically involve storing image data of keyframes, so in order to keep data transfer overhead to a minimum, sparseness of keyframes is helpful here. Approaches, such as the greedy keyframe creation found in ORBSLAM, are likely to be ineffective here and there is significant benefit to the tracker being viewpoint invariant with respect to the map. Kudan’s tracker is especially good at tracking with a sparse map, maintaining accuracy without the need to create new keyframes.
Kudan mapping and tracking a meeting room. Map expansion stops at around the 15 second mark.
Suitable for everywhere?
Not every indoor environment will be suitable for VPS and there will be several issues to contend with.
As with all visual systems, there needs to be sufficient visual interest to work reliably. This is especially important for relocalisation, where uniqueness of certain regions of the environment is necessary to avoid false-positive detections. This is further amplified when only smaller areas of the environment can fit within view of the camera due to a lower field of view, or physical constraints forcing the user close to various parts. Indoor establishments won’t want to change their layouts and repetition is a common theme in design.
In real-world SLAM use-cases you rarely get to view the environment unobstructed, especially in public places where there is increased likelihood of other people walking through the frame. Robustness to occlusion is very important to avoid this problem. While occlusion is typically defined as things blocking the view, for a SLAM system it’s any time it expected to be able to see a point but was unable to. This could happen for multiple reasons, including a non-rigid map, which is probable in a lot of shopping locations as products get moved around and misaligned. Changing lighting conditions can also present as occlusion, where certain areas get whited out, or sit in dark shadows.
In testing of the Kudan tracker we’ve found we can deal with upwards of 80% point occlusion in some cases. While accuracy does suffer a little bit, it’s enough to maintain tracking until the occlusion can be reduced.
Lighting conditions indoors will hopefully be less dynamic than outdoors, where the system needs to contend with the weather, but there are still certain challenges. Flickering lights are often a problem in vision, or environments that are simply too dark for the kinds of cameras found in mobile devices to get a low-noise exposure. Kudan’s tracker performs well with high noise but there’s not a lot it can do with an incredibly dark image in cases of underexposure.
Another issue is dynamic range. Indoor environments with no windows are likely to have everything fairly well exposed, but there can be issues when coming across windows on bright days and it can become impossible to get a nice exposure without HDR. There’s also the additional problem of how the autoexposure behaves in these situations, which could cause sudden dramatic changes in exposure, which the tracker should be robust to.
As we move towards mobile SLAM use-cases that likely result in more time spent, battery life starts to become an issue. Computer vision isn’t the lightest of processing tasks and a poorly optimised implementation could easy utilise the entire CPU. Modern mobile devices are designed to process in short bursts and spend the majority of the time idling. If used at 100% for a short period time time you would see your battery charge plummet.
The challenge then, is not just simply being able to reliably localise the camera, but being able to do so with minimal CPU usage. We’ve benchmarked our tracker in tracking-only mode and found that with the right combination of settings you can track reliably with as little as 15% (sub-5ms tracking times) CPU utilisation on an iPhone 7. This means the CPU can spend the remaining 85% of the time asleep, which the CPU frequency scaling algorithms detect and throttle the CPU down to a much lower clock speed, further reducing power consumption.