Para acceder a los documentos con el texto completo, por favor, siga el siguiente enlace: http://hdl.handle.net/2117/101281

Selection of correction candidates for the normalization of Spanish user generated content
Melero, Maite; Ruiz Costa-Jussà, Marta; Lambert, Patrik; Quixal, Martí
Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions; Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla
We present research aiming to build tools for the normalization of User-Generated Content (UGC). We argue that processing this type of text requires the revisiting of the initial steps of Natural Language Processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user-generated texts) presents a number of nonstandard communicative and linguistic characteristics – often closer to oral and colloquial language than to edited text. We present a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews, and blogs, and describe its main characteristics. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging. Our aim with this paper is to seize the power of already existing spell and grammar correction engines and endow them with automatic normalization capabilities in order to pave the way for the application of standard Natural Language Processing tools to typical UGC text. Particularly, we propose a strategy for automatically normalizing UGC by adding a module on top of a pre-existing spell-checker that selects the most plausible correction from an unranked list of candidates provided by the spell-checker. To build this selector module we train four language models, each one containing a different type of linguistic information in a trade-off with its generalization capabilities. Our experiments show that the models trained on truecase and lowercase word forms are more discriminative than the others at selecting the best candidate. We have also experimented with a parametrized combination of the models by both optimizing directly on the selection task and doing a linear interpolation of the models. The resulting parametrized combinations obtain results close to the best performing model but do not improve on those results, as measured on the test set. The precision of the selector module in ranking number one the expected correction proposal on the test corpora reaches 82.5% for Twitter text (baseline 57%) and 88% for non-Twitter text (baseline 64%).
Peer Reviewed
-Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Llenguatge natural
-Natural language processing (Computer science)
-Tractament del llenguatge natural (Informàtica)
Artículo - Versión presentada
Artículo
         

Mostrar el registro completo del ítem

Documentos relacionados

Otros documentos del mismo autor/a

Melero, Maite; Ruiz Costa-Jussà, Marta; Domingo, Judit; Marquina, Montse; Quixal, Martí
Boleda Torrent, Gemma; Cuadros Oller, Montserrat; España Bonet, Cristina; Melero, Maite; Padró, Lluís; Quixal, Martí; Rodríguez, Carlos
Boleda Torrent, Gemma; Cuadros Oller, Montserrat; España Bonet, Cristina; Melero, Maite; Padró, Lluís; Quixal, Martí; Rodríguez, Carlos
Boleda Torrent, Gemma; Cuadros Oller, Montserrat; España Bonet, Cristina; Melero, Maite; Padró, Lluís; Quixal, Martí; Rodríguez, Carlos
Melero, Maite; Boleda Torrent, Gemma; Cuadros Oller, Montserrat; España Bonet, Cristina; Padró, Lluís; Quixal, Martí; Rodríguez, Carlos; Saurí, Roser