A smart perception module for robotic manipulation using deep learning and foundational models

Other authors

Universitat Politècnica de Catalunya. Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial

Rosell Gratacòs, Jan

Zaplana Agut, Isiah

Publication date

2025-06-03

Abstract

This report presents the design, implementation, and validation of a novel framework called Vi- sion Intelligent System for Task Awareness (VISTA). The VISTA module is a complete smart perception system built with deep learning methods and ROS 2. The framework is designed to detect, estimate, and track the pose of household objects from the Yale-CMU-Berkeley (YCB) dataset. It is considered smart because it interfaces with entities called state reasoners that in- teract with a knowledge base, transforming perception data into structured knowledge that external agents can use for informed decision-making in manipulative tasks. At the core of this novel architecture are the new presented state reasoners, which infer knowledge by combining raw perception data with information from the knowledge base in a fast manner. The proposed methodology is structured around an object detection and segmentation algo- rithm based on a You Only Look Once (YOLO) model (version 11). Its output feeds into a pose estimation and tracking algorithm using FoundationPose in its model-based version, extended with replanning capabilities to recover from tracking failures. The FoundationPose output is then passed to the state reasoner module, which builds geometric and color knowledge of de- tected objects to update the knowledge base. The methodology also includes a transfer learning pipeline for the YOLO model, adapted to detect a selected subset of YCB objects. For this pur- pose, a synthetic dataset was generated for training, complemented by a real dataset used for validation. Validation results show that the model generalizes well from synthetic to real im- ages, achieving high detection confidence. The implementation details of the framework and the integration of its components are pre- sented in the implementation section. The assessment results presented after the validation section confirm the successful development of a fully functional perception pipeline. Integra- tion experiments demonstrates the benefits of using ROS 2 as the middleware, yielding a mod- ular, flexible, and easily configurable system. Moreover, the inclusion of the knowledge base improves the robustness of the perception system by reducing misdetections and duplicate out- puts, thanks to the use of ground-truth information. The combined use of FoundationPose and the YOLO detector proved highly effective, achieving accurate and fast enough pose estimation at a constant throughput of 15 frames per second. Furthermore, the inferred knowledge from state reasoners enhanced the autonomy of the robotic system used during experiments, enabling informed decision-making for manipulation tasks such as pick-and-place. In conclusion, this work describes the successful implementation of a smart perception sys- tem that improves both accuracy and robustness through the integration of knowledge base information and reasoning capabilities. The inferred knowledge is proved to be useful for im- proving robotic decision-making during manipulation tasks with minimal human intervention. Nonetheless, certain limitations must be considered, such as the need for a sufficiently high cam- era frame rate (at least 20 frames per second for both Red-Green-Blue (RGB) and depth images) and access to a powerful Graphics Processing Unit (GPU) for heavyweight deep learning infer- ence. Without these, the performance of FoundationPose vanishes significantly. Finally, the report provides recommendations for future developments to extend and refine the proposed framework.

Document Type

Master thesis

Language

English

Publisher

Universitat Politècnica de Catalunya

Recommended citation

This citation was generated automatically.

Rights

Open Access

This item appears in the following Collection(s)