Assessing vector-based retrieval in elasticsearch for financial document search and analysis

Pascual García, Alfred; Pascual García, Alfred

Assessing vector-based retrieval in elasticsearch for financial document search and analysis

Para acceder a los documentos con el texto completo, por favor, siga el siguiente enlace: https://hdl.handle.net/2117/455482

Autor/a

Pascual García, Alfred

Otros/as autores/as

Universitat Politècnica de Catalunya. Departament d'Estadística i Investigació Operativa

Duarte López, Ariel

Fecha de publicación

2026-01-30

Resumen

This study develops and evaluates a production-ready vector-based semantic retrieval system in Elasticsearch for financial document search, with a particular focus on Securities and Exchange Commission (SEC) filings. The work is motivated by the need for accuracy and traceability in financial analysis and is framed within a Retrieval-Augmented Generation (RAG) pipeline, where retrieval quality directly impacts the reliability of AI-assisted outputs and helps reduce hallucinations by grounding responses in authoritative documents. The system indexes document chunks as dense embeddings and compares multiple embedding models and retrieval strategies, including brute-force exact scoring (cosine similarity and dot product) and approximate nearest-neighbor search using Elasticsearch’s supported hierarchical navigable small world (HNSW) algorithm. Experiments are conducted in an Elasticsearch + Kibana environment deployed via Docker Compose and evaluated on three established financial retrieval benchmarks: SecQue Bench, FinGPT Bench, and FinDER Bench. Retrieval effectiveness is measured with Mean Reciprocal Rank (MRR), while efficiency is assessed through per-query latency to analyze speed–accuracy trade-offs across configurations. Results show that Qwen3-0.6B consistently achieves the highest retrieval effectiveness across datasets, with all-mpnet-base-v2 as the most competitive alternative. Additionally, HNSW reduces query latency relative to exact script-based scoring while maintaining very similar MRR in most configurations, indicating a favorable operational trade-off for real deployments. Overall, the project demonstrates that Elasticsearch and tis capabilities can support an efficient semantic retrieval layer for financial documents, and that embedding model selection primarily sets the upper bound on retrieval quality, while approximate retrieval methods improve responsiveness with minimal loss in effectiveness.

Tipo de documento

Bachelor thesis

Lengua

Inglés

Materias y palabras clave

Àrees temàtiques de la UPC::Economia i organització d'empreses; Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació::Emmagatzematge i recuperació de la informació; Financial engineering; Information storage and retrieval systems; Enginyeria financera; Informació--Sistemes d'emmagatzematge i recuperació

Publicado por

Universitat Politècnica de Catalunya

Citación recomendada

Esta citación se ha generado automáticamente.

Exportar

DIDL MARC MARC_CCUC METS OAI_DC ORE QDC RDF

Derechos

Open Access

Este ítem aparece en la(s) siguiente(s) colección(ones)

Treballs acadèmics [82075]

Assessing vector-based retrieval in elasticsearch for financial document search and analysis

Autor/a

Otros/as autores/as

Fecha de publicación

Compartir

Resumen

Tipo de documento

Lengua

Materias y palabras clave

Publicado por

Citación recomendada

Exportar

Derechos

Este ítem aparece en la(s) siguiente(s) colección(ones)