Syntactic-based Collocation Extraction From Parallel .

1y ago
9 Views
1 Downloads
694.01 KB
23 Pages
Last View : 8d ago
Last Download : 3m ago
Upload by : Fiona Harless
Transcription

Syntactic-based Collocation Extractionfrom Parallel Corpora and from the WebLuka Nerima, Eric Wehrli, Violeta SeretanLATL - Language Technology LaboratoryUniversity of Geneva, n, Luka.Nerima, Eric.Wehrli}@lettres.unige.ch

Frameworkresearch project"Linguistic Analysis and Collocation Extraction"2002-2003Geneva International Academic Network (RUIG-GIAN)partner World Trade Organization (WTO)main goalnnnautomatic extraction of multi-word terminology from textsfocus on collocationsimprove the working environment of terminologists andtranslators6 May, 2004Ecole doctorale lémanique en sciences du langage

OutlineIntroductionnnmulti-word expressions classificationcollocations - definitions, extraction methodsSyntactic-based collocation extractionnnsyntactic candidate filteringfrom bigrams to arbitrarily long collocationsIntegrated system:ncollocation extraction, visualization and validationCollocate discovery using Web search6 May, 2004Ecole doctorale lémanique en sciences du langage

Problemto the delegate of the European Communities, itAccording adopted but theyappeared that the measures had been enter into forcewould only later in January 2004.- collocability/ predictability/ preference- beyond word level: multi-word expressions- capture relations between words- lexicographic/ automatic means6 May, 2004Ecole doctorale lémanique en sciences du langage

Multi-Word Expressionslexical or syntactical unitsprevalent in language (as many as single words)rough classification:ncompound wordsw service pack, address book, all of a sudden, in front ofnidiomsw be up in arms, have a frog in one's throat, be a fifth wheel, entry intoforcencollocationsw massive investment, meet requirement, schedule appointment, dependon, weapons of mass destruction, numerical system, run through, TheNew York Stock Exchangecollocations vs. compounds: syntactic flexibilitycollocations vs. idioms: semantic compositionality6 May, 2004Ecole doctorale lémanique en sciences du langage

Collocation. Two DefinitionsSinclair, 1991:nn"Collocation is the occurrence of two or more words within ashort space of each other in a text''general, statistical approachw words co-occurring more often than by chancew "arbitrary and recurrent word combination" (Benson, 1990)Manning and Schütze, 1999n "an expression consisting of two or more words thatcorrespond to some conventional way of saying things"n restrictive, linguistic approachw " each word has a particular and roughly stable likelihood of occurringas argument, or operator, of a given word" (Harris, 1988)6 May, 2004Ecole doctorale lémanique en sciences du langage

Collocation AcquisitionlexicographynnnThe BBI Dictionary of English Word Combinations (Benson et al.,1986)Collins COBUILD English Language Dictionary (Sinclair, 1987)Dictionnaire explicatif et combinatoire du français contemporain(Mel'cuk, 1984)automatic extractionnnSinclair 1991, Choueka et al. 1983, Church and Hanks 1990,Smadja 1993statistical methods:w frequency counts, mobile window, independence hypothesis tests (t, χ2,log-likelihood ratios), information theoretic measures (mutualinformation)6 May, 2004Ecole doctorale lémanique en sciences du langage

Collocation Extraction1.candidate selectionword expressions (usually pairs) that may constitutecollocationsusually no/very little syntactic processingnn2.candidate rankingnn6 May, 2004order according to the collocational strengthbased on words statistics in the corpusEcole doctorale lémanique en sciences du langage

Syntactic-based CollocationExtractioncandidate selectionnsyntactic filterw candidate: not any pair of words, but only words in a givensyntactic relationncollocation patterns:Adjective - Noun, Noun - [Pred] - Adjective, Noun - Noun, Verb - Prep,Verb - Prep - Argument, Noun - Prep - Noun, Noun - Noun, Adjective Prep - Noun, , Subject - Verb, Verb - Objectnuclear weapon, custom administration, rely on, act of war,share fall, provide supplythe (filtered) candidates are passed to the statisticaltest6 May, 2004Ecole doctorale lémanique en sciences du langage

Advantagesno textual proximity limitationnnA proposal for the financing of the variable costs will be made tothe Committee usually, pure statistical methods limit the collocate search space toa window of 5 words (combinatorial explosion)distinction among different readingsnndisambiguation during parsingpencher (to lean) vs. se pencher sur une question (to look into anissue)afford morpho-syntactic variationnnnwords inflection - base word form (lemmatization)inversion - cannonical positionextraposition - passivization, relativization, topicalizationw at a costi of 5 billion that i is chiefly being met ei by South Korea andJapan6 May, 2004Ecole doctorale lémanique en sciences du langage

Multi-Word CollocationDiscoverymost concern is for bigrams (word pairs)nstatistical relevance measures - appropriate for pairs of items onlymany collocations longer than two lexical itemsnround of presidential election, provide a steady supply, financialservice supplier, join the euro system, abolish death penaltyidentify chains of bigrams that share common termsnnround of electionpresidential electionw shared word: electionw multi-word collocation candidate: round of presidential electioniteratively linking bigrams - arbitrarily long collocation(candidates)6 May, 2004Ecole doctorale lémanique en sciences du langage

The SystemFips parser (Laenzlinger & Wehrli 1991)nnnbased on an adaptation of Chomsky's Principles and Parameterstheoryfor English, FrenchFipsCo subsystem - syntactic co-occurrences (bigrams)log-likelihood ratios test (Dunning 1993)nncollocation scoreranks bigramsvisualization tools:nnconcordancealignment in parallel corpora (alignment method)collocations validation6 May, 2004Ecole doctorale lémanique en sciences du langage

System Architecturetext corpora(French, English)file selectioncollocation extractionFips parserFipsCosyntactic co-occurrences extractioncollocation score word evalidatedcollocationsdatabase6 May, 2004Ecole doctorale lémanique en sciences du langage

Experimental ResultsCorpusTheEconomistLe MondeSize6.20 Mb879'013 words8.88 Mb1'471'270 wordsBigramsThe Economistprime ministerlast yearmass destructioninterest ratenext yearchief executivebin ladenpoor countrycentral banksee asProcessingTime7'158 s7'936.2 sProcessingSpeed0.88 Kb/s121.5 words/s1.14 Kb/s185.4 words/sBigramsExtracted161'293 total106'713 distinct276'932 total182'298 distinctTri-gramsExtracted58'398 total55'351 distinct119'852 total113'150 distinctLe MondeTri-gramsThe EconomistLe Mondemilliard de francmillion de francpremier foismilliard de dollarpremier ministreAssemblée nationalUnion soviétiquemillion de dollaraffaire étrangerfonction publicweapon of mass destructionhave impact ongo out ofpull out ofmake difference torise in tomove from torise from inplay role inhave interest inministre de affaire étrangerFront du salut nationalministre de éducation nationaltribunal en grande instanceprésident de conseil généralmembre de comité centralmembre de bureau politiqueréaliser chiffre de affairefranc de chiffre de affairechiffre de affaire de milliardTop 10 bigrams ordered by the log-likelihood score, and the 10 most frequent tri-grams extracted6 May, 2004Ecole doctorale lémanique en sciences du langage

Demo6 May, 2004Ecole doctorale lémanique en sciences du langage

Collocate Discovery using WebSearch. Frequency CountsWeb corpus: availability, coverage, search toolscomparing hits numbercollocate 245,000largelyavailable3,690frequency suggests the collocate*Google search engine, 4 May 20046 May, 2004Ecole doctorale lémanique en sciences du langagehits*

Syntactic Approachsimple frequency counts - noisynnnsame context by chance (e.g., headings, not the samesentence)unwanted categoryno inflectionaim: perform syntactic analysis of snippetsnextract only co-occurrences:w syntactically-boundw the desired collocate category (co-occurrence type, e.g.Adjective - Noun)nafford morpho-syntactic variation6 May, 2004Ecole doctorale lémanique en sciences du langage

Methodperform the search with the base word only as querybuild corpus of Web instances (search resultsnippets)syntactic analysis of sentences containing the basewordextract co-occurrences of given type(s)apply statistical collocation testshow the ordered list of co-occurrences and displaycontext (sentence link)6 May, 2004Ecole doctorale lémanique en sciences du langage

DetailsGoogle search enginenhighest number of indexed pagesw (3 billions - 4.28 billions)nAPI access to search serviceadvanced search parametersnretrieve only pages in a given language (homography)Observations:nnnFrench and English only (parser's languages)limited access to Google results (key - 1'000 queries/day,only first 1'000 snippets)time expensive:6 May, 2004w mainly server search time and downloadingw pre-processing (sentence boundaries)w parsingEcole doctorale lémanique en sciences du langage

Results. Evaluation Methodsnumber of different bigrams and processing time for differentresults strata (average for 20 base words)interesting results even for few snippets (200)evaluation:nnagainst BBI dictionarystudents solving cloze exercises without/by using the tool6 May, 2004Ecole doctorale lémanique en sciences du langage

Example. Discussionscollocates for "civilization"comparison with the BBI dictionary entryimprovements:nnninflection for the base wordpage source (directory category, page ranking)speed up the results retrieval6 May, 2004Ecole doctorale lémanique en sciences du langage

Future Workcorpus-driven investigation of collocations syntacticand semantic idiosyncrasydiscovery of syntactic collocation typesnnextracting generic relations (specification, complementation, )compiling a list of interesting syntactic patternsevaluation of extractionnrecall measure#collocations identified/#collocations in corpusnannotated resources (annotation tool)6 May, 2004Ecole doctorale lémanique en sciences du langage

ReferencesBenson, M., Benson, E., and Ilson, R. 1986. The BBI Dictionary of English Word Combinations.Amsterdam: John Benjamins.Benson, M. 1990. Collocations and general-purpose dictionaries. International Journal of Lexicography,3(1):23--35.Choueka, Y, Klein, S. T. and Neuwitz, E. 1983. Automatic Retrieval of Frequent Idiomatic andCollocational Expressions in a Large Corpus''. Journal of the Association for Literary and LinguisticComputing, 4:1.34-38.Church, K. and Hanks, P. 1990. Word association norms, mutual information, and lexicography.Computational Linguistics, 16(1):22–29.Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence. ComputationalLinguistics, 19(1):61--74.Harris, Z. S. 1988. Language and Information. New York: Columbia University Press.Laenzlinger, C. and Wehrli, E. 1991. Fips, un analyseur interactif pour le français. TA informations,32(2):35--49.Manning, C. and Schütze, H. 1999. Foundations of Statistical Natural Language Processing. Cambridge,Mass.: MIT Press.Mel'cuk, I. A. et al. 1984, 1988, 1992, 1999. Dictionnaire explicatif et combinatoire du françaiscontemporain: Recherches lexico-sémantiques I, II, III, IV. Montréal: Presses de l'Université de Montréal.Sinclair, J. 1987. Collins-Cobuild English language dictionary. Ed. by J. Sinclair. London:Collins. (CCELD).Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press.Smadja, F. 1993. Retrieving collocations form text: Xtract. Computational Linguistics, 19(1):143--177.6 May, 2004Ecole doctorale lémanique en sciences du langage

6 May, 2004 Ecole doctorale lémanique en sciencesdu langage Collocation Acquisition lexicography n The BBI Dictionary of English Word Combinations (Benson et al., 1986) n Collins COBUILD English Language Dictionary (Sinclair, 1987) n Dictionnaire explicatifet combinatoire du français contemporain (Mel'cuk, 1984) automatic extraction n Sinclair 1991, Choueka et al. 1983, Church and Hanks 1990,

Related Documents:

Advance Extraction Techniques - Microwave assisted Extraction (MAE), Ultra sonication assisted Extraction (UAE), Supercritical Fluid Extraction (SFE), Soxhlet Extraction, Soxtec Extraction, Pressurized Fluid Extraction (PFE) or Accelerated Solvent Extraction (ASE), Shake Flask Extraction and Matrix Solid Phase Dispersion (MSPD) [4]. 2.

Statistical approach of collocation extraction has been a dominant trend for years, from [4, 9, 6] to [5, 7, 1]. Mutual Information (MI) is one of most early and widely used measures, referred the by the majority of research papers on collocation extraction. In [8], a total of 82 . association

representation of syntactic structures. Representation of syntactic structures can be presented in three ways: statements of the correct sequence of the parts of speech (syntactic categories), by series of transformational rules and by parsing diagrams. Syntactic categorie

Collins COBUILD English Collocations in-cludes about 140,000 collocations of 10,000 headwords of English core vocabulary. Collocation is of great importance in Natural Language Processing (NLP) as well as in Linguistics and Applied Linguistics. Various methods of automatic collocation identification and extraction have been proposed.

this software is not intended to be an automatic collocation extraction tool, but it is collocation extraction aided software.! 1.!The statistical values should be interpreted relatively rather than absolutely.! 2.!Using different statistical methods will yield different results. 19 34 Tips on using Colloc Extract

Processing for ontology extraction from text April 28, 05 27 TM and NLP for ontology extraction from text lexical information extraction syntactic analysis semantic information extraction April 28, 05 28 Lexical acquisition collocations n-grams April 28, 05 29 Collocations A collocation is an expression consisting of

syntactic structures and to facilitate developing academic English proficiency among L2 students while they read real-world texts. As such, we employ visual-syntactic text formatting (VSTF) technology to visualize syntactic structures without abridging c

2 Annual Book of ASTM Standards, Vol 01.06. 3 Annual Book of ASTM Standards, Vol 01.01. 4 Annual Book of ASTM Standards, Vol 15.08. 5 Annual Book of ASTM Standards, Vol 03.02. 6 Annual Book of ASTM Standards, Vol 02.05. 7 Annual Book of ASTM Standards, Vol 01.08. 8 Available from Standardization Documents Order Desk, Bldg. 4 Section D, 700 Robbins Ave., Philadelphia, PA 19111-5094, Attn: NPODS .