The C-ORAL-BRASIL I: reference corpus for informal spoken Brazilian Portuguese

Tommaso Raso*, Heliana Mello

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

14 Citations (Scopus)

Abstract

The C-ORAL-BRASIL is a Brazilian Portuguese spontaneous speech corpus, representative of the state of Minas Gerais diatopy (primarily from the capital city, Belo Horizonte,metropolitan area). The corpus was compiled following the same architecture and segmentation criteria adopted by the C-ORAL-ROM [1] as well as its alignment software, the WinPitch [2]. The corpus comprises 139 informal speech texts, 208,130 words, 21:08:52 hours of recording (6.1 GB wav files). The mean word number per text is 1,500. The recordings were carried out with high resolution, non-invasive wireless equipment, generally with clip-on, monodirectional microphones, and a mixer whenever there were more than two interactants, in a few occasions omnidirectional microphones were used. The texts are transcribed following the CHAT format [3], implemented for prosodic annotation [4]. The main goals for the corpus architecture are the documentation of the diaphasic and diastratic variations in Brazilian Portuguese speech.
Original languageEnglish
Title of host publicationComputational Processing of the Portuguese Language - 10th International Conference, PROPOR 2012, Proceedings
Pages362-367
Number of pages6
DOIs
Publication statusPublished - 2012
Externally publishedYes
Event10th International Conference on Computational Processing of Portuguese, PROPOR 2012 - Coimbra, Portugal
Duration: 17 Apr 201220 Apr 2012

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume7243 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference10th International Conference on Computational Processing of Portuguese, PROPOR 2012
Country/TerritoryPortugal
CityCoimbra
Period17/04/1220/04/12

Keywords

  • Brazilian Portuguese
  • Corpus compilation
  • information structure
  • PoS tagging
  • Spontaneous speech

Fingerprint

Dive into the research topics of 'The C-ORAL-BRASIL I: reference corpus for informal spoken Brazilian Portuguese'. Together they form a unique fingerprint.

Cite this