Replication data for: a study on the conceptual structure of the use of prepositions in the complement of goal-oriented motion verbs in Brazilian portuguese



This dataset consists in replication data for the study “A Study on the Conceptual Structure of the Use of Prepositions in the Complement of Goal-oriented Motion Verbs in Brazilian Portuguese”, in which based on a usage-feature analysis and using corpus-based and multivariate statistical methods, we analyze the use of the prepositions a ‘at’, para ‘to’, and em ‘in’ to introduce the complement of ir ‘to go’, vir ‘to come’ and chegar ‘to arrive’ in BP. The results show that there is a tendency for the use of a ‘at’ in the most formal and monitored register. The factors ‘profiling’ and ‘verb’ are the most important language-internal predictors. Action and neutral profiled events are more associated with the use of para ‘to’, while locative profiled events are more associated with the use of em ‘in’. The verb chegar ‘to arrive’ is more associated with the use of em ‘in’. We highlight that (i) the variation investigated has a cognitive basis, in addition to the linguistic and extralinguistic acting factors pointed out by previous studies, and (ii) the variation of prepositions conveys alternative construals; thus, the very high frequency of em ‘in’ next to the goal-oriented motion verbs indicates nuances of meaning motivated by the superimposition of image schemas and the cognitive operation of profiling. This dataset consists of 459 occurrences of goal-oriented motion verbs in Brazilian Portuguese (ir ‘to go’, vir ‘to come’ or chegar ‘to arrive’) manually annotated according to a set of linguistic, social and cognitive factors. Data were extracted from four BP corpora: (i) C-Oral-Brasil (263,000 words), which includes spontaneous oral language transcripts; (ii) Blogs_Foruns (263,772 words), which includes BP forums from written informal language; (iii) TecEM (234,717 words), which includes texts written by teenagers students during their BP classes in high school; and (iv) Corpus Brasileiro (CB), which consists of only texts classified as journalistic (250,700,829 words) and includes texts from Brazilian newspapers. The archive contains data in an Unicode-encoded text file (Dataset_Motion_Verb_Prep.csv), the statistical analysis script in a txt (R_script_Motion_Verbs_Prep.txt), and a Read Me data in a txt file (00_readme.txt). Methodological information: The four subcorpora were selected to have a sample composed of occurrences with different levels of monitoring, representing a continuum from formal written texts (newspaper/CB) to spontaneous speech (C-Oral), passing through school texts (TecEM) and informal written texts (Blogs_Foruns). This study exclusively considered the constructions that followed the structure “verb (ir ‘to go’, vir ‘to come’ or chegar ‘to arrive’) + {0 up to 3 words} + preposition (a ‘at’, para ‘to’ or em ‘in’) + complement”. First, a random sample of 300 occurrences of each subcorpus was generated through a concordance search; then, the tokens that for some reason did not follow the inclusion criteria were excluded (e.g., sentences in which a was an article and not a preposition, meaning ‘the’ instead of ‘at’; sentences in which the motion verb was an auxiliar and not the main verb). The final dataset consists of 459 tokens.
Date made available20 Jun 2023
Date of data production2023

