Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació
Universitat Politècnica de Catalunya. Intelligent Data Science and Artificial Intelligence
Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Services, Information and Data Engineering
2024
The ability to extract valuable information from data is crucial for organizations and individuals who want to remain competitive in a constantly evolving data-driven environment. However, some of them lack the skills required to appropriately leverage the existing data analytics tools and methods. This problem is aggravated when the users are domain-experts but completely unfamiliar with data analytics terminology, as existing assistant tools, such as AutoML or Intelligent Discovery Assistants, require them to state their analytical intent (i.e., the type of data analysis they want to perform). To address this problem, we propose to capture the underlying analytical intent from textual problem descriptions by leveraging Large Language Models (LLMs). To this end, we propose a hierarchical categorization of analytical intents, along with a data collection methodology to obtain analytical problem descriptions for all of them in order to validate different approaches that aim to extract such intents from text. Next, we compare the performance of state-of-the-art approaches with LLMs, and then study the performance of different LLMs based on their characteristics and the impact of the source of validation data. Finally, we develop a prototype to showcase how our method could interact with existing AutoML systems.
Gerard Pons is supported by the EU’s Horizon Programme call, under Grant Agreement No. 101093164 (ExtremeXP), and Besim Bilalli is partially supported by the DOGO4ML project, funded by the Spanish Ministerio de Ciencia i Innovación under the funding scheme PID2020-117191RB-I00/AEI/10.13039/501100011033.
Peer Reviewed
Postprint (author's final draft)
Conference report
English
Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació; Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Llenguatge natural; Analytical intents; Data science; Large language models
Springer
https://link.springer.com/chapter/10.1007/978-3-031-70421-5_8
info:eu-repo/grantAgreement/EC/HE/101093164/EU/EXPeriment driven and user eXPerience oriented analytics for eXtremely Precise outcomes and decisions/ExtremeXP
info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-117191RB-I00/ES/DESARROLLO, OPERATIVA Y GOBERNANZA DE DATOS PARA SISTEMAS SOFTWARE BASADOS EN APRENDIZAJE AUTOMATICO/
Open Access
E-prints [72986]