Lexical Tools ASCII Conversion

2y ago

17 Views

2 Downloads

261.85 KB

24 Pages

Last View : 3m ago

Last Download : 3m ago

Upload by : Amalia Wilborn

Report this link

Download PDF

Transcription

Lexical ToolsASCII ConversionDr. Chris J. LuThe Lexical Systems GroupNLM. LHNCBC. CGSBMarch, 2011

Table of Contents Introduction ASCII conversion Character Document Corpus Software/APIs Example Questions

ASCII Character Set ASCII: American Standard Code for Information Interchange Contains 128 7-bit coded characters Value range: U 0000 U 007F Includes: alphabetic characters: A, B, C, numeric characters: 0, 1, 2, 3, control characters: ESC, FS, CR, graphic characters: #, , %, &, *, (, ), . The most common used standard code (before Unicode)

Unicode A character encoding specification published by theUnicode Consortium Includes all of the major world‟s writing systems Becomes the industry standard Allows data to be transported through different systems Very useful when dealing with multilingual NLP Latest version Unicode 6.0.0, 2011

Unicode Transformation Format Unicode Encoding Including UTF-7, UTF-8, UTF-16, UTF-32 UTF-8 has become the dominant character encoding Backward-compatible with ASCII Avoiding the complications of endianness No need to use byte order marks (BOM)

Lexicon & Lexical Tools Released in UTF-8 format since 2006 Provides functions to convert UTF-8 to ASCII Character Text Document

Why ASCII Conversion? Non-ASCII Unicode are commonly seen even inEnglish documents, such as “Déjà Vu “, “Café”,“resumé”, etc. Some NLP projects still only deal with ASCII

The Challenges Not one-to-one mapping: Many to one: å, â, ã, á, à, ä to a One to many: to ![COPYRIGHT SIGN]!, (c), or just simplyremoved One to none: French borrowing “divorcé” means a man who isdivorced. This word has no pure ASCII spelling variant inWebster‟s Dictionary, while the converted ASCII word, “divorce”,is another closely related word Misused Unicode characters (before the conversion) μ (mu, U 03BC) and µ (micro sign, U 00B5) ß (Sharp S , U 00DF) and β (beta, U 03B2) ¶ (Pilcrow Sign, U 00B6) and π (PI, U 03C0) Wrong conversions (meaning changed) to (c): copyright or cellular phone number? divorcé to divorce

Conversion Guidelines Preserve semantic and/or graphic representation Example : Graphic: TM Semantic: ![TRADE MARK SIGN]! Graphic and Semantic: (TM), or (tm) NLP: empty string, consider as a stopword Different NLP applications might apply different methodsdue to different requirements and objectives There is no best method for ASCII conversion

Character Conversion Strip diacritics:å, â, ã, á, à, ä, ê, é, è, ë, î, í, ì, ï, ô, õ, ó, ø, ò, ö, û, ú, ù, ü, ý, ç, ñ, etc. Split ligatures:Æ, æ, Œ, , œ, ﬀ, ﬂ, ﬃ, etc. Punctuation mapping:“double quotation”, „single quotation‟, Ŕ , -, etc. Symbols mapping: , , , , , , , etc. Combinations:ǽ [U 01FD], Dž [U 01C5], ¾ [U 00BE], etc Others:α, β, etc

Lexical Tools Unicode related functions (flow components)LVGFlowDescriptionInput (UTF-8)Output (ASCII)-f:qStrips diacriticDéjà VuDeja Vu-f:q0Symbols & punctuation“Quote”"Quote"-f:q1-f:q2Unicode mappingSplits ligatures⅔2/3spælsauspaelsau-f:q3Unicode names ![COPYRIGHT SIGN]!-f:q4Unicode Synonymμ (mu, U 03BC)µ (Micro sign, U 00B5)-f:q5Normalize UnicodeUMLS UMLS![REGISTERED SIGN]!-f:q6(-f:q7:q3)Normalize Unicode w SynonymsUMLS UMLS![REGISTERED SIGN]!-f:q7(-f:q4:q7:q3 )Core NormǢAE-f:q8(recursive -f:q0:q1:q2:q)Strip or Map (not ICU)Zadaxin Zadaxin-f:q8Strip or Map (not ICU)αalpha

Lexical Tools (Cont.) Pure ASCII conversionLVGFlow(s)Desc.Pure ASCIIOutputs-f:q5Normalize UnicodeYesSingle-f:q6YesSingle-f:N-f:N3Normalize Unicode f:q7:q8Serial FlowsYesSingleToAsciiASCII conversionYesSingle

Text Conversion Many different ways for ASCII conversion The SPECIALIST Lexical Tools Provides various powerful functions Is configurable according to the specifications Use ToAsciiFree Text(Unicode)Lexical Tools(ToAscii)Free Text(ASCII)

Corpus ConversionCorpus(Unicode) ToAscii Algorithm fromdomain expertsCorpus(ASCII)

Corpus Conversion - LexiconConversion AlgorithmLexicon(Unicode) ToAscii Delete if it is new Delete if it is duplicated Delete if it has a different meaningLexicon(ASCII)

Delete: If New Delete the conversion if it is new (not known to Lexicon) Theoretically, the ASCII Lexicon is a subset of Unicode Lexiconsince ASCII is a subset of Unicode All converted bases should be known to (contained inside) Lexicon Example - Müthing” [E0573093]: The record is deleted (“Muthing” is not know to Lexicon){base Müthingentry E0573093cat nounvariants regvariants uncountproper}{base Muthingentry E0573093cat nounvariants regvariants uncountproper}Delete

Delete: If Duplicated Delete the conversion if it is a duplication Example Ŕ resume [E0053099] Spelling variants are removed{base resumespelling variant résuméspelling variant resuméentry E0053099cat nounvariants reg}{base resumespelling variant resumespelling variant resumeentry E0053099cat nounvariants reg}

Delete: If Meaning Changed Delete the conversion if it has a different meaning Example Ŕ mu [E0041164]: Spelling variant “μm” is deleted because its ASCIIconversion, “mum” [E0041369], is a different record{base muspelling variant μspelling variant μmentry E0041164cat nounvariants invvariants metaregabbreviation of micrometer E0040123}{base mu{base mumspelling variant muspelling variant mum entry E0041369cat nounentry E0041164variants regcat noun}variants invvariants metaregabbreviation of micrometer E0040123}

NLP Software ConversionNLP Software/APIs (Unicode)- Algorithm- Unicode DataASCII NLP ProjectSoftware Components- Data out (ASCII) - Data in for furtherprocessX Traditional approach Interface approachResults from APIs (Unicode)

Traditional ApproachASCII NLP ProjectSoftware Components- Data out (ASCII) - Data in for furtherprocessNLP Software/APIs (Unicode)- Algorithm- Unicode Data- ASCII DataResults from APIs (ASCII) This traditional approach is tedious and not practical

Interface ApproachASCII NLP ProjectSoftware Components- Data out (ASCII) - Data in for furtherprocessNLP Software/APIs(Unicode)- Algorithm-Unicode DataResults from APIs (Unicode)- ToAscii- Remove unknown conversions- Remove duplicated conversions The interface approach is easy and generic

Application ExampleTraditional ApproachLexical Tools APIs(Unicode)- Algorithm- ASCII data (Db tables)ASCII NLP Project(MetaMap)Software Component- Data out (ASCII) Results from APIs(ASCII)- Data in for furtherprocess Interface ApproachLexical Tools API (Unicode)- Algorithm- Unicode dataResults from Lexical Tools- ToAscii- Remove unknown conversions- Remove duplicated conversions Identical results from both approaches over 0.5M testcases for 2010 release

References Unicode Consortium - http://www.unicode.orgICU (International Components for Unicode) - http://site.icu-project.orgLexical Tools Unicode Documents , Chris J.; Browne, Allen C.; Divita, Guy, "Using Lexical Tools to ConvertUnicode Characters to ASCII", Proceeding of AMIA 2008 AnnualSymposium, Nov. 8-12, 2008, Washington DC, p. 1031Lu, Chris J. and Browne, Allen C., "Converting Unicode Lexicon and LexicalTools for ASCII NLP", Submitted for publication in Proceeding of AMIA 2011Annual Symposium, Oct. 22-16, 2011, Washington DC

Questions Lexical Systems Group: http://umlslex.nlm.nih.gov The SPECIALIST NLP Tools: http://specialist.nlm.nih.gov

Theoretically, the ASCII Lexicon is a subset of Unicode Lexicon since ASCII is a subset of Unicode All converted bases should be known to (contained inside) Lexicon Example - Müthing” [E0573093]: .

Related Documents:

ASCII Code The extended ASCII table

The following ASCII table contains both ASCII control characters, ASCII printable characters and the extended ASCII character set ISO 8859 1, also called ISO Latin1

42 Views

2y ago

ASCII Character Set and Ansi

Table B-1 lists the standard ASCII character set. Note that items 2 through 32, the control characters, and the extended ASCII character set are not included. A 1101. Item Number Symbol Meaning ASCII in Decimal Representation ASCII in Binary Representation ASCII in Hex Representation 1 . Null 0 0000 0000 0 33 b/ Space 32 0010 0000 20 34 .

20 Views

2y ago

Auditory lexical decision in children with specific ...

test whether temporal speech processing limitation in SLI could interfere with the autonomous pre-lexical process (Montgomery, 2002) -lexical contact and lexical . It is worth noting that the auditory lexical decision task and the receptive vocabulary measure taps two different levels of processing; the last one. Lexical decision in children .

11 Views

2y ago

Non Printable & Special Characters: Problems and how to ...

ASCII TABLE ASCII stands for American Standard Code for Information Interchange. ASCII was originally designed for use with teletypes. Computers can only understand numbers; hence an ASCII code is the numerical representation of a character such as 'a' or 'A' or an action such as 'ESC' or 'DEL'. There are total of 256 ASCII characters (including

24 Views

2y ago

Structural Ambiguity and Lexical Relations - ACL Anthology

Resolving ambiguity through lexical asso- ciations Whittemore et al. (1990) found lexical preferences to be the key to resolving attachment ambiguity. Similarly, Taraban and McClelland found lexical content was key in explaining people's behavior. Various previous propos- als for guiding attachment disambiguation by the lexical

12 Views

1y ago

central Kenya Bantu The syntax and semantics of causative affixes in

causative constructions found in languages viz. non-lexical and lexical. The non-lexical causative, . The non-lexical causative shows ambiguity when used with adverbs Downloaded by [Kenyatta University] at 00:03 08 March 2016 . 388 but the lexical causative does not have this ambiguity (Cooper, 1976:323). To illustrate,

14 Views

1y ago

An Analysis of Lexical Collocation Errors in Students' Writing

lexical collocations, and using the correct lexical collocations continuously in oral and written communication. The study of lexical collocation has been conducted by many researchers in the past few decades. The first previous study was by Martelli (2004) about a study of English lexical collocations written by Italian

8 Views

1y ago

AAT Advanced Diploma in Accounting - ICS Learn

Our AAT Advanced Diploma in Accounting course is the intermediate level of AAT’s accounting qualifications. You’ll master more complex accountancy skills, including advanced bookkeeping, preparing final accounts, and management costing techniques. You’ll also cover VAT issues in business, and the importance of professional ethics - all without giving up your job, family time or social .

59 Views

3y ago

Recent Views

IN THIS ISSUE CAR WASH INSIGHT Recent, Notable M&A Transactions .

9/8/2022 Club Car Wash Sites of Tidal Wave Express Car Wash 8 8/29/2022 Take 5 Car Wash Soft Touch Car Wash, Auto Oasis Car Wash, Clearwater Car Wash and Birdie's Car Wash 5 8/25/2022 WhiteWater Express Geaux Clean Car Wash 7 8/19/2022 ModWash Home Team Car Wash 3 8/18/2022 Splash In ECO Car Wash (Wills Group) Blue Hen Car Wash 2

9m ago

100 Views

Personal insurance - Car & Business insurance King Price Insurance

The king's insurance options 5 Things you need to know 7 The stuff you need to do 14 How to claim 16 Our commitment to you 20 Car insurance 22 Car warranty 37 Shortfall cover 45 Scratch and dent 46 Tyre and rim 48 Motorbike insurance 53 Trailer and caravan insurance 64 Watercraft insurance 68 Home contents insurance 77 Buildings insurance 89

1y ago

673 Views

ESSENTIAL PLAN - Discovery

Car insurance only Car and home insurance Car insurance only Car and home insurance 12.5% 25% 5% 10% YOUR FUEL CASH BACK PERCENTAGE GET TO THE HIGHEST CASH BACK PERCENTAGE Add at least R250 000 of home insurance (household contents, buildings or both) Take your car to Tiger Wheel & Tyre and pass the Annual MultiPoint check

1y ago

269 Views

CAR INSURANCE EVERYTHING EXPLAINED - RSA Insurance Group

CAR INSURANCE 93013821.indd 1 15/03/2018 10:46. 2 WELCOME TO µ CAR INSURANCE Thank you for choosing µ to protect you and your car. This booklet is intended to help you check your cover and to reassure you that µ will give you the protection you need for the year ahead. First of all, to help you understand your car insurance policy we want to .

1y ago

274 Views

Describe types and purposes of insurance.

D.O. CAPS Consumer Skills: Insurance—10E 3 Your car - The car you drive can also affect your insurance rates. Insurance companies place certain kinds of cars in special risk categories. You should ask your insurance agent before making a car purchase to make sure you aren't getting a car that will cost you extra for your liability insurance.

1y ago

233 Views

Contours Options Infant Car Seat Adapter Instruction Sheet

your Infant Car Seat, as described in the instruction manual provided by the Infant Car Seat manufacturer. † WHEN USING ONLY ONE INFANT CAR SEAT ADAPTER OR TWO FOR TWINS, THE FOLLOWING INFANT CAR SEATS CAN BE USED: † If your Infant Car Seat is not one of the models listed above, DO NOT use your infant car seat with this car seat adapter.

2y ago

564 Views

Microsoft Advertising Travel Update

last minute cruise deals -58.50% Car Rental Queries WoW Change car rental -43.80% rental cars -46.30% car rentals -40.60% cheap car rentals -48.00% car rentals cheapest rates -52.20% rent a car- 40.30% cheap rental cars -45.60% rental car -41.80% car rental deals -49.30% rental cars lowest price -53.90% Flight Queries WoW Change cheap flights .

1y ago

337 Views

Design and development of lift for an automatic car parking system

1. Stacker type car parking system 2. Puzzle type car parking system 3. Level type car parking system 4. Chess type car parking system 5. Rotary type car parking system 6. Tower type car parking system But lift is used only in tower type car parking system. Objectives:-

6m ago

172 Views

Gold Tier - MAPFRE Insurance

Foy Insurance of MA, LLC 198 Frank Consolati Insurance Agency, Inc. 198 County Insurance Agency, Inc. 198 Woodrow W Cross Agency 214 Woodland Insurance Agency, Inc. 214 Tegeler Insurance Services of CT, Inc. 214 Pantano/VonKahle Insurance Agency, Inc. 214 . Hanson Insurance Agency, Inc. 287 J.H. Slattery Insurance Agency, Inc. 287

1y ago

565 Views

Car Insurance This booklet covers:Car Rapid Bonus Business

Car Insurance This booklet covers:Car Rapid Bonus Business RAC Direct Insurance is a trading name of London and Edinburgh Insurance Company Limited. Registered in England No 924430. Registered Office: 8 Surrey Street, Norwich NR1 3NG. Member of the Aviva Group. Authorised and regulated by the Financial Services Authority. RAC052(V27)-1971-06.06 .

1y ago

218 Views

Root Insurance (ROOT) - Citron Research

Root Insurance (ROOT) Leveling the Playing Field of Car Insurance What every trader needs to know about one of the mostheavily shorted stocks in the market Traditional Credit-Based Car Insurance PerpetuatesEconomic and Racial Inequalities as one in three American cannot affordessentials because of car insurance premiums

1y ago

209 Views

NK-ID 0192-8365-3702-0D3E - Car-O-Liner

CAR-O-DATA. 4. The vast majority of vehicles on the road today can be found in Car-O-Liner's database. Your . Car-O-Tronic. is delivered with a 14-day trial . Car-O-Data Vision2. subscription. Car-O-Data. is available with different subscription periods and database. 4. Check all options with our distributors. SOFTWARE PART. NO. Vision2 X1 .

3y ago

321 Views

46686 Vision2 IM EN r0 - Metropolitan Car-o-liner

Car-O-Tronic, Vision2 Software and Car-O-Data. Car-O-Tronic is the measuring hardware, Vision2 Software is the measuring software. Car-O-Data is a database containing Car-O-Liner DataSheets, photo DataSheets and indexes for most vehicles. Car-O-Data is available through an online subscription or a DVD subscription which is updated 4 times a year.

3y ago

295 Views

Colorado Masonic Library & Museum Store

York Rite 15.00 _ CE40 Car Emblem - Order of the Eastern Star Cut-Out Auto Car Emblem-CE40 OES 15.00 _ CE41 Car Emblem - Shriners Cut-Out Auto Car Emblem-CE41 Shrine 15.00 _ CE42 Car Emblem - 33rd Degree Wings Up Cut-Out Auto Car Emblem-CE42 Scottish Rite 15.00 _ CE43 Car Emblem Free & Ac

2y ago

517 Views

Queueing Theory Part 2 - UW Courses Web Server

Queueing Theory-12 Car Wash Example Consider the following 3 car washes Suppose cars arrive according to a Poisson input process and service follows an exponential distribution Fill in the following table What conclusions can you draw from your results? ! µ! L L q W W q P 0 Car Wash A 0.1 car/min 0.5 car/min Car Wash B 0.1 car/min

1y ago

245 Views

Lexical Tools ASCII Conversion

It looks like you're using an ad-blocker