Knowledge graph inference from text

dc.contributor.author
Fuentes del Pino, Raul
dc.date.accessioned
2025-10-22T19:49:17Z
dc.date.available
2025-10-22T19:49:17Z
dc.date.issued
2025-10-20T15:34:15Z
dc.date.issued
2025-10-20T15:34:15Z
dc.date.issued
2025
dc.identifier
http://hdl.handle.net/10230/71583
dc.identifier.uri
http://hdl.handle.net/10230/71583
dc.description.abstract
Treball fi de màster de: Master in Intelligent Interactive Systems
dc.description.abstract
Supervisor: Maria Inés Torres
dc.description.abstract
Knowledge graphs (KGs) play a central role in representing structured information of real world entities and their relationships and supporting tasks such as reasoning, search, and question answering. Recent advances in natural language processing have opened the door to building KGs directly from text, making it possible to automate the extraction of entities and their semantic relations at scale. This work presents a complete pipeline for constructing a knowledge graph from natural language and using it to improve the accuracy and reliability of answers generated by large language models (LLMs). The proposed system consists of four main stages. First, entities are identified using a domain adapted Named Entity Recognition model trained with BIO tagging on a corpus of automatically generated sentences. Then, a relation extraction component generates subject–predicate–object triples using a generative Transformer based model fine-tuned with domain-specific data. These structured triplets are encoded into dense vectors using semantic embeddings and indexed using a high performance HNSW (Hierarchical Navigable Small World) search structure. This allows efficient retrieval of relevant facts at inference time. To evaluate the performance of the system, experiments were conducted across twelve semantic domains using the WikiDialogue dataset. The evaluation focused on entity detection accuracy and triplet extraction precision. Results show that training on multiple domains significantly improves generalization. Moreover, the use of strict entity boundaries and span-based matching helps reduce false positives. In the final stage, the indexed triplets are used as input to an LLM (LLaMA 3.3–70B Instruct) using a Retrieval-Augmented Generation setup. Given a user query, the system retrieves semantically similar triplets and includes them in the model prompt. Three prompting strategies are explored: no context (the model answers based solely on its internal knowledge), strict (answers only if the fact is explicitly stated in the retrieved context), and permissive (uses background knowledge when the context is insufficient). Examples show that strict prompts reduce hallucinations, while permissive prompts provide fluent answers but may introduce unsupported information. Overall, the results confirm that a modular KG-based pipeline can enhance LLM output by grounding it in structured knowledge. The system is scalable, adaptable to multiple domains, and highlights the trade-offs between precision, recall, and fluency in generation tasks. This opens the door to applying similar approaches in settings where precision and clarity are essential, such as healthcare, finance, or enterprise data systems.
dc.format
application/pdf
dc.language
eng
dc.rights
Llicència CC Reconeixement-NoComercial-CompartirIgual 4.0 Internacional (CC BY-NC-SA 4.0)
dc.rights
https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights
info:eu-repo/semantics/openAccess
dc.subject
Models lingüístics
dc.title
Knowledge graph inference from text
dc.type
info:eu-repo/semantics/masterThesis


Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)