Lexical Tools ASCII Conversion

2y ago
17 Views
2 Downloads
261.85 KB
24 Pages
Last View : 3m ago
Last Download : 3m ago
Upload by : Amalia Wilborn
Transcription

Lexical ToolsASCII ConversionDr. Chris J. LuThe Lexical Systems GroupNLM. LHNCBC. CGSBMarch, 2011

Table of Contents Introduction ASCII conversion Character Document Corpus Software/APIs Example Questions

ASCII Character Set ASCII: American Standard Code for Information Interchange Contains 128 7-bit coded characters Value range: U 0000 U 007F Includes: alphabetic characters: A, B, C, numeric characters: 0, 1, 2, 3, control characters: ESC, FS, CR, graphic characters: #, , %, &, *, (, ), . The most common used standard code (before Unicode)

Unicode A character encoding specification published by theUnicode Consortium Includes all of the major world‟s writing systems Becomes the industry standard Allows data to be transported through different systems Very useful when dealing with multilingual NLP Latest version Unicode 6.0.0, 2011

Unicode Transformation Format Unicode Encoding Including UTF-7, UTF-8, UTF-16, UTF-32 UTF-8 has become the dominant character encoding Backward-compatible with ASCII Avoiding the complications of endianness No need to use byte order marks (BOM)

Lexicon & Lexical Tools Released in UTF-8 format since 2006 Provides functions to convert UTF-8 to ASCII Character Text Document

Why ASCII Conversion? Non-ASCII Unicode are commonly seen even inEnglish documents, such as “Déjà Vu “, “Café”,“resumé”, etc. Some NLP projects still only deal with ASCII

The Challenges Not one-to-one mapping: Many to one: å, â, ã, á, à, ä to a One to many: to ![COPYRIGHT SIGN]!, (c), or just simplyremoved One to none: French borrowing “divorcé” means a man who isdivorced. This word has no pure ASCII spelling variant inWebster‟s Dictionary, while the converted ASCII word, “divorce”,is another closely related word Misused Unicode characters (before the conversion) μ (mu, U 03BC) and µ (micro sign, U 00B5) ß (Sharp S , U 00DF) and β (beta, U 03B2) ¶ (Pilcrow Sign, U 00B6) and π (PI, U 03C0) Wrong conversions (meaning changed) to (c): copyright or cellular phone number? divorcé to divorce

Conversion Guidelines Preserve semantic and/or graphic representation Example : Graphic: TM Semantic: ![TRADE MARK SIGN]! Graphic and Semantic: (TM), or (tm) NLP: empty string, consider as a stopword Different NLP applications might apply different methodsdue to different requirements and objectives There is no best method for ASCII conversion

Character Conversion Strip diacritics:å, â, ã, á, à, ä, ê, é, è, ë, î, í, ì, ï, ô, õ, ó, ø, ò, ö, û, ú, ù, ü, ý, ç, ñ, etc. Split ligatures:Æ, æ, Œ, , œ, ff, fl, ffi, etc. Punctuation mapping:“double quotation”, „single quotation‟, Ŕ , -, etc. Symbols mapping: , , , , , , , etc. Combinations:ǽ [U 01FD], Dž [U 01C5], ¾ [U 00BE], etc Others:α, β, etc

Lexical Tools Unicode related functions (flow components)LVGFlowDescriptionInput (UTF-8)Output (ASCII)-f:qStrips diacriticDéjà VuDeja Vu-f:q0Symbols & punctuation“Quote”"Quote"-f:q1-f:q2Unicode mappingSplits ligatures⅔2/3spælsauspaelsau-f:q3Unicode names ![COPYRIGHT SIGN]!-f:q4Unicode Synonymμ (mu, U 03BC)µ (Micro sign, U 00B5)-f:q5Normalize UnicodeUMLS UMLS![REGISTERED SIGN]!-f:q6(-f:q7:q3)Normalize Unicode w SynonymsUMLS UMLS![REGISTERED SIGN]!-f:q7(-f:q4:q7:q3 )Core NormǢAE-f:q8(recursive -f:q0:q1:q2:q)Strip or Map (not ICU)Zadaxin Zadaxin-f:q8Strip or Map (not ICU)αalpha

Lexical Tools (Cont.) Pure ASCII conversionLVGFlow(s)Desc.Pure ASCIIOutputs-f:q5Normalize UnicodeYesSingle-f:q6YesSingle-f:N-f:N3Normalize Unicode f:q7:q8Serial FlowsYesSingleToAsciiASCII conversionYesSingle

Text Conversion Many different ways for ASCII conversion The SPECIALIST Lexical Tools Provides various powerful functions Is configurable according to the specifications Use ToAsciiFree Text(Unicode)Lexical Tools(ToAscii)Free Text(ASCII)

Corpus ConversionCorpus(Unicode) ToAscii Algorithm fromdomain expertsCorpus(ASCII)

Corpus Conversion - LexiconConversion AlgorithmLexicon(Unicode) ToAscii Delete if it is new Delete if it is duplicated Delete if it has a different meaningLexicon(ASCII)

Delete: If New Delete the conversion if it is new (not known to Lexicon) Theoretically, the ASCII Lexicon is a subset of Unicode Lexiconsince ASCII is a subset of Unicode All converted bases should be known to (contained inside) Lexicon Example - Müthing” [E0573093]: The record is deleted (“Muthing” is not know to Lexicon){base Müthingentry E0573093cat nounvariants regvariants uncountproper}{base Muthingentry E0573093cat nounvariants regvariants uncountproper}Delete

Delete: If Duplicated Delete the conversion if it is a duplication Example Ŕ resume [E0053099] Spelling variants are removed{base resumespelling variant résuméspelling variant resuméentry E0053099cat nounvariants reg}{base resumespelling variant resumespelling variant resumeentry E0053099cat nounvariants reg}

Delete: If Meaning Changed Delete the conversion if it has a different meaning Example Ŕ mu [E0041164]: Spelling variant “μm” is deleted because its ASCIIconversion, “mum” [E0041369], is a different record{base muspelling variant μspelling variant μmentry E0041164cat nounvariants invvariants metaregabbreviation of micrometer E0040123}{base mu{base mumspelling variant muspelling variant mum entry E0041369cat nounentry E0041164variants regcat noun}variants invvariants metaregabbreviation of micrometer E0040123}

NLP Software ConversionNLP Software/APIs (Unicode)- Algorithm- Unicode DataASCII NLP ProjectSoftware Components- Data out (ASCII) - Data in for furtherprocessX Traditional approach Interface approachResults from APIs (Unicode)

Traditional ApproachASCII NLP ProjectSoftware Components- Data out (ASCII) - Data in for furtherprocessNLP Software/APIs (Unicode)- Algorithm- Unicode Data- ASCII DataResults from APIs (ASCII) This traditional approach is tedious and not practical

Interface ApproachASCII NLP ProjectSoftware Components- Data out (ASCII) - Data in for furtherprocessNLP Software/APIs(Unicode)- Algorithm-Unicode DataResults from APIs (Unicode)- ToAscii- Remove unknown conversions- Remove duplicated conversions The interface approach is easy and generic

Application ExampleTraditional ApproachLexical Tools APIs(Unicode)- Algorithm- ASCII data (Db tables)ASCII NLP Project(MetaMap)Software Component- Data out (ASCII) Results from APIs(ASCII)- Data in for furtherprocess Interface ApproachLexical Tools API (Unicode)- Algorithm- Unicode dataResults from Lexical Tools- ToAscii- Remove unknown conversions- Remove duplicated conversions Identical results from both approaches over 0.5M testcases for 2010 release

References Unicode Consortium - http://www.unicode.orgICU (International Components for Unicode) - http://site.icu-project.orgLexical Tools Unicode Documents , Chris J.; Browne, Allen C.; Divita, Guy, "Using Lexical Tools to ConvertUnicode Characters to ASCII", Proceeding of AMIA 2008 AnnualSymposium, Nov. 8-12, 2008, Washington DC, p. 1031Lu, Chris J. and Browne, Allen C., "Converting Unicode Lexicon and LexicalTools for ASCII NLP", Submitted for publication in Proceeding of AMIA 2011Annual Symposium, Oct. 22-16, 2011, Washington DC

Questions Lexical Systems Group: http://umlslex.nlm.nih.gov The SPECIALIST NLP Tools: http://specialist.nlm.nih.gov

Theoretically, the ASCII Lexicon is a subset of Unicode Lexicon since ASCII is a subset of Unicode All converted bases should be known to (contained inside) Lexicon Example - Müthing” [E0573093]: .

Related Documents:

The following ASCII table contains both ASCII control characters, ASCII printable characters and the extended ASCII character set ISO 8859 1, also called ISO Latin1

Table B-1 lists the standard ASCII character set. Note that items 2 through 32, the control characters, and the extended ASCII character set are not included. A 1101. Item Number Symbol Meaning ASCII in Decimal Representation ASCII in Binary Representation ASCII in Hex Representation 1 . Null 0 0000 0000 0 33 b/ Space 32 0010 0000 20 34 .

test whether temporal speech processing limitation in SLI could interfere with the autonomous pre-lexical process (Montgomery, 2002) -lexical contact and lexical . It is worth noting that the auditory lexical decision task and the receptive vocabulary measure taps two different levels of processing; the last one. Lexical decision in children .

ASCII TABLE ASCII stands for American Standard Code for Information Interchange. ASCII was originally designed for use with teletypes. Computers can only understand numbers; hence an ASCII code is the numerical representation of a character such as 'a' or 'A' or an action such as 'ESC' or 'DEL'. There are total of 256 ASCII characters (including

Resolving ambiguity through lexical asso- ciations Whittemore et al. (1990) found lexical preferences to be the key to resolving attachment ambiguity. Similarly, Taraban and McClelland found lexical content was key in explaining people's behavior. Various previous propos- als for guiding attachment disambiguation by the lexical

causative constructions found in languages viz. non-lexical and lexical. The non-lexical causative, . The non-lexical causative shows ambiguity when used with adverbs Downloaded by [Kenyatta University] at 00:03 08 March 2016 . 388 but the lexical causative does not have this ambiguity (Cooper, 1976:323). To illustrate,

lexical collocations, and using the correct lexical collocations continuously in oral and written communication. The study of lexical collocation has been conducted by many researchers in the past few decades. The first previous study was by Martelli (2004) about a study of English lexical collocations written by Italian

Our AAT Advanced Diploma in Accounting course is the intermediate level of AAT’s accounting qualifications. You’ll master more complex accountancy skills, including advanced bookkeeping, preparing final accounts, and management costing techniques. You’ll also cover VAT issues in business, and the importance of professional ethics - all without giving up your job, family time or social .