An automatic semantic role labeler for the portuguese language

Falci, Daniel Henrique Mourão

Visualizar/Abrir

Mestrado em Sistemas de Informação e Gestão do Conhecimento FUMEC 2018 (1.149Mb)

Data

2018

Autor

Falci, Daniel Henrique Mourão

xmlui.mirage2.itemSummaryView.MetaData

Mostrar registro completo

Resumo

A anotação de papéis semânticas (APS) é uma tarefa do processamento de linguagem natural que fornece os meios para analisar, do ponto de vista semântico, as informações expressas através de texto ou fala. O objetivo é capturar e representar os participantes e as circunstâncias de eventos ou situações descritas no nível sentencial. É tida como um importante passo para a compreensão da linguagem natural. A maior parte da pesquisa existente sobre a APS é focada na língua inglesa e, portanto, considera suas particularidades sintáticas e semânticas. Este fato impede a transposição direta de seus resultados para outras línguas. Quanto à língua portuguesa, há um pequeno número de estudos dedicados a esta tarefa, e nenhum deles conseguiu um desempenho semelhante ao obtido na língua inglesa. Além disso, ao que sabemos, existe apenas um sistema publicamente disponível capaz de executar a APS automatizada em texto bruto, o que dificulta a pesquisa e detém o potencial inovador para a língua. O objetivo desta dissertação é avaliar o desempenho de um anotador de papéis semânticos automático para a língua portuguesa construído considerando técnicas abordadas na literatura. Para atingir este objetivo, o primeiro passo consistiu em uma revisão sistemática da literatura na tarefa de APS que visou identificar as técnicas mais precisas abordadas na literatura. Com base em seus resultados, desenvolvemos e avaliamos um anotador de papéis semânticos para a língua portuguesa. Nossa abordagem é independente de análise sintática e se apóia em uma arquitetura de rede neural recorrente, bidirecional e profunda. As predições da rede são usadas como a entrada de um algoritmo de análise neural recursiva global que foi adaptado para a tarefa de APS. Nosso método superou, de forma consistente, o sistema mais preciso para a língua portuguesa no Corpus do PropBank-Br por uma margem de 3.05 pontos de F1-score, reduzindo o erro relativo em 8.74%. O modelo apresentado nesta pesquisa está disponível publicamente sob licença BSD e pode ajudar estudos futuros focados na língua portuguesa em tarefas que normalmente dependem da análise de conteúdo, que vão desde a tradução automática até os sistemas de perguntas e respostas.

Semantic Role Labeling (SRL) is Natural Language Processing task that provides the means to analyze, from the semantic point of view, the information expressed through text or speech. Its purpose is to capture and represent the participants and circumstances of events or situations described at the sentential level. It is considered a major step towards natural language understanding. Most of the existing SRL research is focused on the English language, and thus, considers its syntactic and semantic particularities. This fact prevents a direct transposition of its results to other languages. Regarding the Portuguese language, there is a small number of studies dedicated to the task, and none of them achieved a similar performance to that obtained in the English language. Moreover, to the best of our knowledge, there is only one publicly available system capable of performing automated SRL on raw text what hampers research and detain the innovative potential for the language. The objective of this thesis is to evaluate the performance of an automatic semantic role labeler for the Portuguese language built considering techniques addressed in the literature. To achieve this goal, the first step consisted in a systematic literature review on SRL task that intended to identify the most accurate techniques addressed in the literature. Based on its results, we developed and evaluated a semantic role labeler of raw text for the Portuguese language. Our approach is independent of syntactic parsing and relies on a deep bidirectional recurrent neural network architecture. The network predictions are used as the input of a global recursive neural parsing algorithm that was tailored for the SRL task. Our method consistently outperformed the previous state-of-the-art system for the Portuguese language on PropBank-Br corpus by a margin of 3.05 𝐹�1-score points, reducing the relative error in 8.74%. The model presented in this research is publicly available under BSD license and may help future studies focused on the Portuguese language in tasks that are typically dependent on content-analysis, ranging from Machine Translation to Question and Answering Systems.

URI

https://repositorio.fumec.br/xmlui/handle/123456789/178

xmlui.mirage2.itemSummaryView.Collections

Dissertações