(version 1.1)

2y ago
44 Views
2 Downloads
1.34 MB
115 Pages
Last View : 17d ago
Last Download : 2m ago
Upload by : Jayda Dunning
Transcription

The British National Corpus 2014:User manual and reference guide(version 1.1)November 2018

ContentsContents . i1Introduction: what is the BNC2014? . 12The BNC2014 team . 23Accessing the corpus. 24The Spoken BNC2014 . 34.1Data collection: recruitment & recording. 34.2Metadata in the Spoken BNC2014 . 44.2.1Metadata collection: procedure and ethics . 44.2.2Text metadata: categories . 94.2.3Text metadata: other . 144.2.4Sub-text metadata: categories. 154.2.5Speaker metadata: categories . 184.2.6Speaker metadata: other . 314.3Transcription . 344.3.1General approach . 344.3.2Main features of the transcription scheme . 354.3.3Transcription procedures and quality control . 414.3.4Speaker identification . 425The Written BNC2014 . 466The BNC2014 on CQPweb . 4776.1The Spoken BNC2014 in CQPweb. 476.2The Written BNC2014 in CQPweb . 56Encoding and markup . 577.1Overall XML structure . 577.2Spoken corpus document XML . 57i

87.3Spoken corpus header XML. 627.4Written corpus document XML . 637.5Written corpus header XML . 637.6Changes made to the XML for annotation and CQPweb indexing . 63Annotation . 658.1POS tagging . 658.2Lemmatisation . 678.3Semantic tagging . 678.4XML for annotation . 68References . 70List of appendices . 73Appendices . 74ii

1Introduction: what is the BNC2014?The ESRC-funded Centre for Corpus Approaches to Social Science (CASS)1 at LancasterUniversity is leading the compilation of the British National Corpus 2014. This is the firstpublicly-accessible corpus of its kind since the original British National Corpus,2 which wascompleted in 1994, and which, despite its age, is still used as a proxy for present-day Englishin research today. Like its predecessor, the new corpus contains examples of written andspoken British English, gathered from a range of sources. To gather the spoken component,CASS worked together with the English Language Teaching group at Cambridge UniversityPress (CUP), compiling a new, publicly-accessible corpus of present-day spoken BritishEnglish, gathered in informal contexts. This spoken component is known as the SpokenBritish National Corpus 2014 (Spoken BNC2014; Love et al. 2017). The new spoken corpuscontains data gathered in the years 2012 to 2016. As of September 2017 it is availablepublicly via Lancaster University’s CQPweb server (see Hardie 2012); the underlying XMLfiles have been downloadable from Autumn 2018 onwards. The Spoken BNC2014 contains11,422,617 words3 of transcribed content, featuring 668 speakers in 1,251 recordings. TheWritten BNC2014 is currently under development (see http://cass.lancs.ac.uk/bnc2014).In this guide, we use the following naming conventions for the old and new BritishNational Corpora: Original BNC ‘the BNC1994’ New BNC ‘the BNC2014’ Spoken components ‘the Spoken BNC1994’ and ‘the Spoken BNC2014’ Demographically-sampled component of the Spoken BNC1994 ‘the SpokenBNC1994DS’ Written components ‘the Written BNC1994’ and ‘the Written BNC2014’1The research presented in this manual was supported by the ESRC Centre for Corpus Approaches to SocialScience, ESRC grant reference ES/K002155/1. Additional information on CASS and its research can be foundat uk/3All corpus or subcorpus sizes in this document are given as counts of tokens, using the tokenisation outputfrom the CLAWS tagger. These are the figures that can be accessed in the corpus as available via the CQPwebinterface. There are two particular points of note regarding these word counts: (a) punctuation marks arecounted as tokens; (b) clitics that CLAWS separates out from the bases to which they are orthographicallyjoined, e.g., n’t, ’m, ’re, ’d, ’ve, are also counted as tokens. Other calculation methods may produce somewhatdifferent token counts (usually lower).1

2The BNC2014 teamRobbie Love (Lancaster University) was lead researcher for the Spoken BNC2014. AbiHawtin (Lancaster University) is lead researcher for the Written BNC2014. The team alsoincludes (at Lancaster University) Tony McEnery, Vaclav Brezina, Andrew Hardie, ElenaSemino and Matt Timperley; and (at Cambridge University Press) Claire Dembry.In her work on the Spoken BNC2014, Claire Dembry was supported by her team atCUP, including Olivia Goodman, Imogen Dickens, Sarah Grieves and Laura Grimes, who didmuch of the front-line work on the project.An extensive team at Lancaster University and elsewhere have contributed, andcontinue to contribute, to the Written BNC2014. A full set of credits will be included in thefuture version of this manual which accompanies the Written corpus’s release.The construction of the Spoken BNC2014 was jointly funded by CASS and CUP.The construction of the Written BNC2014 is funded by CASS.This corpus manual has been compiled from a combination of material publishedelsewhere in outputs describing the spoken and written corpora, and the project team’shitherto unpublished internal technical documentation. As such, we do not consider thismanual to be a citable document separate from the BNC2014 as a resource; if you wish torefer to the contents of the manual, please cite the corpus as a whole – using one or bothof the canonical references specified in the respective end-user licences for the Spoken andWritten components. For example: “cf. Spoken BNC2014 (Love et al. 2017), corpus manualsection 2)”.3Accessing the corpusThe BNC2014 is a publicly-accessible language resource, but it is not in the public domain. Itremains under copyright, and use of it is subject to the terms of the User Licence. Usersmust agree to the licence in order to access the corpus online or to download a copy of thedata. Licensing and distribution is managed by an online system accessible athttp://corpora.lancs.ac.uk/bnc2014.The Written and Spoken components have different copyright statuses, andtherefore are subject to different licences. See Appendix A for the Spoken BNC2014 userlicence. The Written BNC2014 user licence will be added when the corpus is released.2

44.1The Spoken BNC2014Data collection: recruitment & recordingOne of the most innovative features of the Spoken BNC2014 is the use of PPSR (publicparticipation in scientific research) for data collection (see Shirk et al. 2012). Anyoneinterested in contributing recordings to the Spoken BNC2014 was directed to a websitewhich described the aims of the project and included a contact form to allow them toregister their interest in contributing data. People who registered interest were contactedby the CUP team via email with further instructions. The primary method of capturingpublic attention was a series of national media campaigns in 2014 and 2015. Using an initialtwo-million-word collection made collected by CUP in 2012, we produced lists of wordswhich had increased (e.g. ‘awesome’) and decreased (e.g. ‘marvellous’) in frequency to thegreatest extent in the new data relative to the Spoken BNC1994DS. These lists were usedas the basis for research press releases, which proved very popular in the national UK press.The consequent media coverage generated the most substantial intake of new contributors.In addition to these national media campaigns, we also participated in publicengagement events such as the Cambridge University Festival of Ideas (Dembry & Love2014) and the UK Economic and Social Research Council’s Festival of Social Sciences (Love2015), where we shared early findings from a subset of the corpus and encouragedaudiences to participate. Some supplementary targeted recruitment was conducted whenthe research team identified ‘holes’ in the data. Methods included use of targeted socialmedia advertisements (e.g. targeting Facebook users from Cardiff), press releases specific toa particular social group (e.g. “Mum’s the word both then and now”) and contactingcolleagues from universities in sought-after locations to them to spread word of the project.While the CUP team initiated and maintained direct contact with the contributors(i.e those who recorded conversations), they did not make any direct contact with otherspeakers included in the recordings. Instead, speakers received information about theproject from the contributors. Contributors were therefore responsible for: obtaining informed consent and collecting demographic metadata from thespeakers; and, submitting data and recordings to CUP at the end of the collection period.3

Because of the importance of the contributors to the success of the project, we incentivizedparticipation by offering payment of 18 for every hour of recording of a sufficient qualityfor corpus transcription, and, importantly, submission of all associated consent forms andfull speaker metadata. All speakers were required to give informed consent prior torecording. To ensure that all information and consent was captured, no payments weremade to contributors until all metadata, consent forms and related documentation was fullycompleted for each recording.Contributors were instructed to make recordings using their smartphones. Theywere instructed to make recordings in MP3 format (the standard format for mostsmartphone recording devices), and encouraged to make their recordings in fairly quietlocations, for example household interactions or conversations in quiet cafes. However,contributors were not ‘disallowed’ from recording at any time or place, since we did notwant to anticipate the production of bad recordings, and advise contributors against makingthem, before finding out whether they would be useable. Contributors were given norestriction on the number of speakers that could be involved in conversations, although arecommendation of two to four speakers was given. Likewise, we did not impinge morethan necessary upon the spontaneity of the recording sessions by dictating features such asconversation topic, although a list of suggestions was provided (see Appendix B). Finally, itwas stressed to contributors that under no circumstances could they make recordingssurreptitiously, and that all speakers in the conversation must be aware that recording wastaking place beforehand.4.2Metadata in the Spoken BNC20144.2.1 Metadata collection: procedure and ethicsThe collection of metadata is an extremely important step in the compilation of a spokencorpus as it affords the definition of subcorpora according to different features of thespeakers (e.g. age) or of the recordings themselves (e.g. number of speakers in theconversation). We henceforth refer to the former type as ‘speaker metadata’ and the latteras ‘text metadata’.Contributors were provided with copies of the Speaker Information Sheet (Figure1), and were instructed to have each speaker fill out a copy and return it to the contributor.Since speakers had to individually sign a consent form in any case, the speaker metadata4

form was incorporated into this consent form. This consent form was drafted by the teamat CUP with the collaboration of the CUP legal division.5

Figure 1. The speaker information sheet/consent form used for collection of the Spoken BNC2014.6

The gathering of metadata directly from speakers appears to have achieved itsintended goal. Comparing the number of words which populate the ‘unknown’ groups of themain demographic categories in the Spoken BNC1994DS with the Spoken BNC2014 (Table1), there has been a considerable improvement.Table 1. Number of words categorised as ‘unknown’ or ‘info missing’ for the three maindemographic categories in the Spoken BNC1994DS and the Spoken BNC2014.Demographic Group:category'unknown'/missing’AgeFrequency% of corpusGenderFrequency% of corpusSocioFrequencyeconomicstatus% of corpusSpokenSpoken‘info BNC1994DS 00.00386,89638.103.39In line with the guarantees given in the consent form, it was necessary to anonymise thedata, but to accomplish this (so far as possible) in such a way as not to affect the findings ofsubsequent corpus analyses. These modifications included removing “references to peopleor places” (Baker 2010: 49), and are described in Section 4.3.2.The second form provided to contributors was the ‘Recording Information Sheet’(Figure 2). This information generated text metadata for the corpus. The form also includesa table in which contributors were asked to write the first turn that each speaker spoke inthe corresponding recording. The purpose of this was to aid transcription; it allowedtranscribers to find an example of each speaker’s voice in the recording as identified bysomeone who was present for the recording and likely to be familiar with each of thespeakers’ voices. We collected much more text metadata than the Spoken BNC1994 teamdid; the speaker and text metadata categories are summarized in the next section along withtheir word counts in the corpus.7

Figure 2. Recording Information Sheet used in the Spoken BNC2014.8

4.2.2 Text metadata: categoriesThis section lists all the metadata features recorded at the level of the text that can be usedto classify the texts into categories. For each feature, we include a brief explanation and ascreenshot of the corresponding control in the CQPweb Restricted Query interface.TRANSCRIPTION CONVENTIONS USEDThe data in the corpus from the year 2012 was gathered by CUP before thecommencement of the joint project to develop the Spoken BNC2014. The recordings fromthis period were therefore transcribed using conventions which are different to thetranscription scheme that the research team agreed for the Spoken BNC2014 (and which isdescribed in later sections of this manual). This initial tranche of transcriptions wasautomatically converted into the Spoken BNC2014 XML format. While we have made everyeffort to ensure that the texts derived from the 2012 recordings are formatted in the sameway as the rest of the corpus, we accept that there remainsa possibility of minorinconsistencies of transcription practice and/or of the use of transcription conventions.Therefore, we have made it possible to restrict queries according to which version of thetranscription conventions was used to create each text.ConventionsoriginalrevisedNo. texts No. words2202,068,0541,0319,354,563SAMPLE RELEASE INCLUSIONIn 2016, we released a 4,789,185 word sample of Spoken BNC2014 data to a small numberof researchers selected via an open application process. This sample, known as the SpokenBNC2014S (where S abbreviates Sample), contained all texts from the first stage of data9

collection which had already been transcribed and converted into XML (see McEnery et al.2017 for more information). These researchers were given exclusive early access to thissample via Lancaster University’s CQPweb server for the purpose of conducting researchprojects, as proposed in their applications. In order to facilitate further work on this subset,we have made it possible to restrict queries according to whether or not texts in the fullcorpus were included in the Spoken BNC2014S.Sample release inclusionnot in sample releasewithin sample releaseno. texts no. words6846,633,7305674,788,887NUMBER OF SPEAKERSThis was established by counting the number of speakers listed by the contributor on theRecording Information Sheet, and subsequently checked by automated counts of thenumbers of different speaker ID codes found in each transcription (excluding any instancesof codes indicating an unknown speaker).No. of speakerstwono. texts no. words6224,881,02710

8717,528RECORDING PERIODThis is the quarter in which recordings were gathered. Quarters are defined as 3-monthperiods within a given year (e.g. 2015 Q1 January, February & March 2015).Recording period2012 Q12012 Q22012 Q32012 Q42013 Q12013 Q22013 Q32013 Q42014 Q12014 Q22014 Q32014 Q42015 Q12015 Q22015 Q3no. texts no. 15797,78711

2015 Q42016 Q12016 Q22016 Q32016 Q416118171101,445,8851,520,186629,8615,0770YEAR OF RECORDINGThis is the year in which recordings were gathered. The years are exact supersets of thequarters.Year20122013201420152016no. texts no. 32,155,124TRANSCRIBERThe Spoken BNC2014 was transcribed by a total of 20 transcribers at CUP. We haveincluded in the text metadata an anonymised identification code for the transcriber whocreated each text. This facilitates the investigation of possible inter-transcriberinconsistency.12

Transcriber no. texts no. wordsT01

User manual and reference guide (version 1.1) November 2018 . i . University is leading the compilation of the British National Corpus 2014. This is the first publicly-accessible corpus of its kind since the original British National Corpus,2 which was completed in 1994, and which, despite its age, is still used as a proxy for present-day .

Related Documents:

KENWOOD TS-940 PAGE Version 2: 4 April 2005, Version 3: 25 April 2005, Version 4: 27 May 2005, Version 5: 31May 2005, Version 6: 10 June 2005: Version 7: 16 June 2005: Version 8: 25 July 2005Version 9: 30 July 2005. Version 10: 4 August 2005, Version 11: 13 Sep 2005, Version 12: 18 October 2005, Version 13: 23 October 2005,

Adobe Photoshop Elements (Version 13 or higher) Adobe Illustrator (Version CS6 or higher) AlphaPlugins Launchbox Computerinsel Photoline 64 (Version 16 or higher) CorelDRAW (Version X6 or higher) Corel Painter (Version 12.1 or higher) Corel Paint Shop Pro (Version X6 or higher) Corel Photo-Paint (Version X6 or higher) Paint.NET (with the PSFilterPdn plugin) (Freeware: www.getpaint.net)

software. For DVD write function, this drive confirms to DVD-RW Version 1.2 / DVD RW Part 1 Volume 1 Version 1.3 / DVD RW Part 1 Volume 2 Version 1.0 / DVD-R General Version 2.1 / DVD R Version 1.3 / DVD-R9 Version 3.01 / DVD R9 Version 1.2 / DVD-RAM (4.7G)Version 2.2. For read function, it is capable to read all of the following media: DVD single

837 Health Care Claim Companion Guides Version 2.5 June 2018 iii VERSION CHANGES DATE Version 1.0 DRAFT Sept. 2016 Version 1.1-1.5 Format changes and Final Version Sept. 2016 Version 1.6 Format changes and Final Version March 2017 Version 1.7 Add Instructions for Atypical Providers April 2017

1998; Version 2 was released in February 2001; Version 3 was released in March 2004; Version 4 was released in February 2006; Version 5 was released in November 2007, Version 6 was released in April 2010; and Version 7 was released in September 2012. After four expansions of Version 7 during the last five years, we are now proud to present the .

Dec 13, 2011 · 3 Release Notes for Cisco VPN Client, Release 5.0.07.0290 Downloading the Latest Version † Cisco VPN 3000 Series Concentrator, Version 3.0 or later. † Cisco PIX Firewall, Version 6.2.2(122) or Version 6.3(1). † Cisco IOS Routers, Version 12.2(8)T or later. Downloading the Latest Version To download the version of AnyConnect, you must be a registered user of Cisco.com.File Size: 212KB

From Sage 300 ERP Development Partner Wiki The following sections describe data tables, database changes, and report changes for Sage 300 ERP Bank Services. 1 Data Tables in Version 5.6 and Later Versions 2 Data Tables in Version 5.5A 3 Database Changes 3.1 Version 6.2A 3.2 Version 6.1A 3.3 Version 6.0A PU1 3.4 Version 6.0A 3.5 Version 5.6A PU2

Hot tap version Weight (kg) Weight (lbs) Version with retrofit adapter (version V1) 1.8 3.96 Version with weld-in nipple (version V2) 2.2 4.85 Version with flange (version V3) 4.3 9.47 8 Installation 8.1 Installing the hot tap process connection LDANGER Load is too high! Damage to pipe.