Negation and speculation detection in medical and review texts
- Cruz Díaz, Noa Patricia
- Manuel Jesús Maña López Director
Defence university: Universidad de Huelva
Fecha de defensa: 10 July 2014
- Manuel de Buenaga Rodríguez Chair
- Jacinto Mata Vázquez Secretary
- Mariana Lara Neves Committee member
Type: Thesis
Abstract
Negation and speculation detection has been an active research area during the last years in the Natural Language Processing community, including some Shared Tasks in relevant conferences. In fact, it constitutes a challenge in which many applications can benefit from identifying this kind of information (e.g., interaction detection, information extraction, sentiment analysis). This thesis aims to contribute to the ongoing research on negation and speculation in the Language Technology community through the development of machinelearning systems which determine the speculation and negation cues and resolve their scope (i.e., identify at sentence level which tokens are affected by the cues). It is focused on the two domains in which negation and hedging have drawn more attention: the biomedical and the review domains. In the first one, the proposed method improves the results to date for the sub-collection of clinical documents of the BioScope corpus. In the second, the novelty of the contribution lies in the fact that, to the best of our knowledge, this is the first system trained and tested on the SFU Review corpus annotated with negative and speculative information. At the same time, this is the first attempt to detect speculation in the review domain. Additionally, and due to the tokenization problems that were encountered during the preprocessing of the BioScope corpus and the small number of works in the bibliography which propose solutions for this problem, this thesis closely describes this issue and provide both a comprehensive overview analysis and evaluation of a set of tokenization tools. This means, the first comparative evaluation study of tokenizers in the biomedical domain which could help Natural Language Processing developers to choose the best tokenizer to use.