The annotation of the C-ORAL-BRASIL spoken corpus using an adaptation of the Palavras Parser

Eckhard Bick, Heliana Mello, Alessandro Panunzi, Tommaso Raso

Resultado de pesquisarevisão de pares

5 Citações (Scopus)

Resumo

This article describes the morphosyntactic annotation of the C-ORAL-BRASIL speech corpus, using an adapted version of the Palavras parser. In order to achieve compatibility with annotation rules designed for standard written Portuguese, transcribed words were orthographically normalized, and the parsing lexicon augmented with speech-specific material, phonetically spelled abbreviations etc. Using a two-level annotation approach, speech flow markers like overlaps, retractions and non-verbal productions were separated from running, annotatable text. In the absence of punctuation, syntactic segmentation was achieved by exploiting prosodic break markers, enhanced by a rule-based distinctions between pause and break functions. Under optimal conditions, the modified parsing system achieved correctness rates (F-scores) of 98.6% for part of speech, 95% for syntactic function and 99% for lemmatization. Especially at the syntactic level, a clear connection between accessibility of prosodic break markers and annotation performance could be documented.
Idioma originalEnglish
Título da publicação do anfitriãoProceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012
EditoresMehmet Ugur Dogan, Joseph Mariani, Asuncion Moreno, Sara Goggi, Khalid Choukri, Nicoletta Calzolari, Jan Odijk, Thierry Declerck, Bente Maegaard, Stelios Piperidis, Helene Mazo, Olivier Hamon
EditoraEuropean Language Resources Association (ELRA)
Páginas3382-3386
Número de páginas5
ISBN (eletrónico)9782951740877
Estado da publicaçãoPublicado - 2012
Publicado externamenteSim
Evento8th International Conference on Language Resources and Evaluation, LREC 2012 - Istanbul
Duração: 21 mai. 201227 mai. 2012

Série de publicação

NomeProceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012

Conferência

Conferência8th International Conference on Language Resources and Evaluation, LREC 2012
País/TerritórioTurkey
CidadeIstanbul
Período21/05/1227/05/12

Impressão digital

Mergulhe nos tópicos de investigação de “The annotation of the C-ORAL-BRASIL spoken corpus using an adaptation of the Palavras Parser“. Em conjunto formam uma impressão digital única.

Citação