Enhancing Digital Twins with Semantic Understanding
08.28.2024
Imagine a robot tasked with preparing a meal. To accomplish this, it first needs to navigate to the kitchen. If the robot already has its own internal map representation for the environment, then this map requires the concept of a kitchen to somehow be encoded within it. The robot can also recognise it is in a kitchen by identifying the types of objects it can see, such as cupboards, and kitchen appliances.
To perform its task it will also need to recognise the specific objects it will need to interact with and make educated guesses on where to find them. Knives might be hidden in a drawer, and certain ingredients will likely be stored in the refrigerator. It will also need to understand object parts and how it can interact with them: the oven handle opens the door, and the dial sets the temperature.
We increasingly require the digital twin representations of the environment that we build to include more semantic information – an understanding of their contents, what they could be used for, and how to use them.
Encoding Meaning into Digital Representations
Approaches to encode semantic meaning into a digital environment representation can vary in complexity. Simple representations can go a long way. For example, using object bounding boxes to identify where items are located can often provide enough information for a robot’s high-level planner to navigate and interact with its environment.
However, there are scenarios where a richer representation is required. In such cases, we might label individual points or voxels (volume elements) within a 3D space as belonging to a specific object category. The challenge of encoding meaning becomes even more complex when we consider that sometimes we may not know the labelling we want ahead of time. This unpredictability necessitates a flexible approach, such as relying more on recognition at run-time, or only encoding intermediate representations, features, into the map and finishing the labelling later.
Leveraging 3D methods for encoding and interpreting this information can be powerful, yet it often still remains challenging to obtain sufficient training data for these models. On the other hand, models that operate on 2D images tend to be more mature and readily available. We often use cameras to build our maps however, so one powerful approach can also be to combine 2D methods with 3D map representations.
The Role of Language Models
Recent progress in the development of large language models (LLMs) has significantly accelerated the path toward semantic methods. Through learning to generate language, they build word embeddings that capture the high-dimensional relationships between words. When combined with other types of data, such as images, into multi-modal models we can effectively bridge between different types of data. The system’s ability to recognise and understand objects and their uses can be based on both visual and linguistic cues.
These types of foundation models are significantly expensive to build and train. We want to ensure we are not competing against the big firms by trying to build our own. Instead if we use these types of models we can instead put ourselves in the position where we are benefiting from the continuous release of new and improved models.
Enhancing Usability and Robustness with Semantics
Incorporating semantics into digital representations promises to greatly enhance interaction and usability. Humans naturally communicate through language, and so the optimal interface for human-robot interaction will likely also include language. In addition, several modern approaches to high-level robot task planners, which break down tasks and decide what actions to take, now incorporate the use LLMs. Embedding semantic information into spatial environment representations will enable systems to carry out more complex tasks and interact with their environments in a richer, more intuitive manner.
Semantic understanding can also be used to boost robustness. Consider high-fidelity reconstructions, such as Neural Radiance Fields (NeRF) or Gaussian Splatting: the more a representation accurately models exactly how the environment was at capture-time, the harder it is for other modules to recognise the system is at a particular location when some aspects of the environment have changed. Humans are able to recognise they are in a restaurant, for example, by identifying tables, chairs, cutlery, and glasses, even though two different restaurants may look completely different. If we can extract a higher-level semantic representation for a location, a system will be able to recognise it despite significant changes in appearance, for example, recognising at night-time an area that was mapped during the day.
Semantics can also improve robustness by helping the system understand what requires attention. Not all objects carry the same amount of weight. If a system is mapping an outside location and sees a parked car, it can understand that the car is likely to move and is not worth paying attention to for the task of determining its location. This prioritisation allows the system to focus on more stable and relevant features, improving its overall performance and reliability.
Conclusion
At Kudan, we are developing our Semantic Digital Twin (SDT) technology to embed rich semantic information into digital environment representations. By leveraging semantic understanding, we can make the systems we are working with smarter, more efficient, and better equipped to handle real-world tasks. We are actively engaging in proof-of-concept projects with customers to explore and validate these capabilities.
■For more details, please contact us from here.