<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="static/style.xsl"?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2026-04-14T04:10:19Z</responseDate><request verb="GetRecord" identifier="oai:www.recercat.cat:2117/348065" metadataPrefix="oai_dc">https://recercat.cat/oai/request</request><GetRecord><record><header><identifier>oai:recercat.cat:2117/348065</identifier><datestamp>2026-03-02T05:32:57Z</datestamp><setSpec>com_2072_1033</setSpec><setSpec>col_2072_452951</setSpec></header><metadata><oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:doc="http://www.lyncode.com/xoai" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
   <dc:title>Clustering and topic modeling for biomedical text mining</dc:title>
   <dc:creator>Rognon, Paul Joris Denis</dc:creator>
   <dc:contributor>Universitat de Barcelona. Departament de Genètica, Microbiologia i Estadística</dc:contributor>
   <dc:contributor>Reverter Comes, Ferran</dc:contributor>
   <dc:contributor>Vegas Lozano, Esteban</dc:contributor>
   <dc:subject>Àrees temàtiques de la UPC::Matemàtiques i estadística::Estadística matemàtica</dc:subject>
   <dc:subject>Statistical Mathematics -- Applications</dc:subject>
   <dc:subject>Text mining</dc:subject>
   <dc:subject>Document clustering</dc:subject>
   <dc:subject>Topic modeling</dc:subject>
   <dc:subject>Word embeddings</dc:subject>
   <dc:subject>Biomedical text mining</dc:subject>
   <dc:subject>Estadística matemàtica--Aplicacions</dc:subject>
   <dc:subject>Classificació AMS::62 Statistics::62P Applications</dc:subject>
   <dc:description>In this work, we study the problem of characterizing an unlabelled corpus of biomedical documents in an unsupervised manner. After a review of the literature on the subject, we propose an integrative approach to the problem. The integration is twofold. On one hand, we integrate, with multiview learning, different text representations derived from a traditional bag-of-words model, Latent Dirichlet Allocation, and a recurrent neural autoencoder. On the other hand, we integrate topic modeling outputs, clustering outputs and biomedical word embeddings to generate an intuitive and comprehensive characterization of the corpus. We also propose a semantic graph that supplies a synthetic visualization of the relationships between topics, clusters, and any other biomedical concept, based on semantic similarity. An application to the CORD-19 dataset, a collection of articles on COVID-19, shows our methodology produces a coherent, meaningful, and informative characterization of the corpus.</dc:description>
   <dc:date>2021-06</dc:date>
   <dc:type>Master thesis</dc:type>
   <dc:identifier>https://hdl.handle.net/2117/348065</dc:identifier>
   <dc:identifier>FME-2098</dc:identifier>
   <dc:language>eng</dc:language>
   <dc:rights>Restricted access - author's decision</dc:rights>
   <dc:format>application/pdf</dc:format>
   <dc:publisher>Universitat Politècnica de Catalunya</dc:publisher>
   <dc:publisher>Universitat de Barcelona</dc:publisher>
</oai_dc:dc></metadata></record></GetRecord></OAI-PMH>