From perception to action: implementing in-context imitation learning on a franka robot for pick-and-place tasks

Publication date

2025-11-06T17:13:20Z

2025-11-06T17:13:20Z

2025



Abstract

Treball fi de màster de: Erasmus Mundus joint Master in Artificial Intelligence (EMAI)


Supervisor: Alessandro De Luca Co-Supervisor: Magí Dalmau Moreno


This thesis presents a practical implementation of Instant Policy, an In-Context Imitation Learning (ICIL) model characterized by the rapid learning of new tasks, after processing a few number of demonstrations at inference time. The research evaluates how demonstration context modifications affect the model ability to understand and generalize manipulation behaviors using a Franka Emika Panda arm and Intel RealSense D435 camera integrated with Instant Policy, a state-of-the-art one-shot learning model. The core research systematically modifies demonstration buffers to analyze the model contextual reasoning capabilities across different pick-and-place scenarios. Besides, we deploy a modular pipeline that transforms RGB-D input into structured point clouds through YOLOv11-based segmentation, enabling object identification, demonstration extraction and model deployment at test time. To address gripper annotation challenges, we introduce an automated dataset creation methodology combining LangSAM for text-prompt-based segmentation and XMem++ for video mask propagation. The control architecture employs Instant Policy as a Denoising Diffusion Implicit Model, generating action sequences through graph-based reasoning over point clouds and demonstration context. Experimental results demonstrate successful adaptation of pick-and-place behaviors based on different demonstration contexts, with generalization across object pose and background variations. Performance analysis reveals critical dependencies on segmentation quality, highlighting robust perception requirements for real-world deployment. This work validates ICIL viability for robotic pick-and-place tasks, contributing insights into context understanding, automated dataset creation, and empirical validation of ICIL performance in unstructured manipulation scenarios.

Document Type

Master's final project

Language

English

Subjects and keywords

Aprenentatge

Recommended citation

This citation was generated automatically.

Rights

Llicència CC Reconeixement-NoComercial-SenseObraDerivada 4.0 Internacional (CC BY-NC-ND 4.0)

https://creativecommons.org/licenses/by-nc-nd/4.0/

This item appears in the following Collection(s)