Knowledge graph inference from text

Fuentes del Pino, Raul

Knowledge graph inference from text

dc.contributor.author

Fuentes del Pino, Raul

dc.date.accessioned

2025-10-22T19:49:17Z

dc.date.available

2025-10-22T19:49:17Z

dc.date.issued

2025-10-20T15:34:15Z

dc.date.issued

2025-10-20T15:34:15Z

dc.date.issued

2025

dc.identifier

http://hdl.handle.net/10230/71583

dc.identifier.uri

http://hdl.handle.net/10230/71583

dc.description.abstract

Treball fi de màster de: Master in Intelligent Interactive Systems

dc.description.abstract

Supervisor: Maria Inés Torres

dc.description.abstract

Knowledge graphs (KGs) play a central role in representing structured information of real world entities and their relationships and supporting tasks such as reasoning, search, and question answering. Recent advances in natural language processing have opened the door to building KGs directly from text, making it possible to automate the extraction of entities and their semantic relations at scale. This work presents a complete pipeline for constructing a knowledge graph from natural language and using it to improve the accuracy and reliability of answers generated by large language models (LLMs). The proposed system consists of four main stages. First, entities are identified using a domain adapted Named Entity Recognition model trained with BIO tagging on a corpus of automatically generated sentences. Then, a relation extraction component generates subject–predicate–object triples using a generative Transformer based model fine-tuned with domain-specific data. These structured triplets are encoded into dense vectors using semantic embeddings and indexed using a high performance HNSW (Hierarchical Navigable Small World) search structure. This allows efficient retrieval of relevant facts at inference time. To evaluate the performance of the system, experiments were conducted across twelve semantic domains using the WikiDialogue dataset. The evaluation focused on entity detection accuracy and triplet extraction precision. Results show that training on multiple domains significantly improves generalization. Moreover, the use of strict entity boundaries and span-based matching helps reduce false positives. In the final stage, the indexed triplets are used as input to an LLM (LLaMA 3.3–70B Instruct) using a Retrieval-Augmented Generation setup. Given a user query, the system retrieves semantically similar triplets and includes them in the model prompt. Three prompting strategies are explored: no context (the model answers based solely on its internal knowledge), strict (answers only if the fact is explicitly stated in the retrieved context), and permissive (uses background knowledge when the context is insufficient). Examples show that strict prompts reduce hallucinations, while permissive prompts provide fluent answers but may introduce unsupported information. Overall, the results confirm that a modular KG-based pipeline can enhance LLM output by grounding it in structured knowledge. The system is scalable, adaptable to multiple domains, and highlights the trade-offs between precision, recall, and fluency in generation tasks. This opens the door to applying similar approaches in settings where precision and clarity are essential, such as healthcare, finance, or enterprise data systems.

dc.format

application/pdf

dc.language

eng

dc.rights

Llicència CC Reconeixement-NoComercial-CompartirIgual 4.0 Internacional (CC BY-NC-SA 4.0)

dc.rights

https://creativecommons.org/licenses/by-nc-sa/4.0/

dc.rights

info:eu-repo/semantics/openAccess

dc.subject

Models lingüístics

dc.title

Knowledge graph inference from text

dc.type

info:eu-repo/semantics/masterThesis

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

Treballs d'estudiants [4945]