Writing Systems, Transliteration And Decipherment

3y ago
35 Views
2 Downloads
5.45 MB
73 Pages
Last View : 7d ago
Last Download : 3m ago
Upload by : Evelyn Loftin
Transcription

Writing Systems,Transliteration andDeciphermentKevin KnightUniversity of Southern CaliforniaInformation Sciences InstituteRichard SproatOregon Health & Science UniversityCenter for Spoken Language UnderstandingOverview An overview of writing systems Transcription/transliteration betweenscripts Traditional and automatic approaches todeciphermentKnight/SproatWriting Systems, Transliteration and Decipherment1

Part IWriting Systems and EncodingsSome terminology A script is a set of symbols A writing system is a script paired with alanguage.Knight/SproatWriting Systems, Transliteration and Decipherment3

What could writing systemsrepresent? In principle any linguistic level“My dog likes avocados”Knight/SproatWriting Systems, Transliteration and Decipherment4What do writing systems actuallyrepresent? Phonological information:– Segmental systems: Alphabets Abjads Alphasyllabaries– Syllables (but full syllabaries are rare) Words in partially logographic systems Some semantic information:– Ancient writing systems like Sumerian, Egyptian,Chinese, Mayan But no full writing system gets by withoutsome representation of soundKnight/SproatWriting Systems, Transliteration and Decipherment5

Roadmap Look at how Chinese writing works:Chinese is the only “ancient” writingsystem in current use, and in many ways itrepresents how all writing systems used tooperate. Detour slightly into “semantic-only” or“logographic” writing. Survey a range of options for phonologicalencodingKnight/SproatWriting Systems, Transliteration and Decipherment6The “six writings” xiàngxíng simple pictograms–‘person’, ‘wood’,‘turtle’zhĭshì indicators–‘above’, ‘below’huìyì meaning compound–‘bright’ (SUN MOON) xíngshēng phonetic compounds–‘oak’ (TREE xiàng), ‘duck’ (BIRD jiǎ)zhuǎnzhù ‘redirected characters’–‘trust’ (PERSON WORD) jiǎjiè ‘false borrowings’ (rebuses)–‘come’ (from an old pictograph for ‘wheat’)Knight/SproatWriting Systems, Transliteration and Decipherment7

Xíngshēng characters95% of Chinese Characters ever invented consistof a semantic and a phonetic componentKnight/SproatWriting Systems, Transliteration and Decipherment8Knight/SproatWriting Systems, Transliteration and Decipherment9

Knight/SproatWriting Systems, Transliteration and Decipherment10Knight/SproatWriting Systems, Transliteration and Decipherment11

Knight/SproatWriting Systems, Transliteration and Decipherment12Knight/SproatWriting Systems, Transliteration and Decipherment13

A generalization of huìyì:Japanese kokuji (国字 )Knight/SproatWriting Systems, Transliteration and Decipherment14Japanese logography Japanese writing has three subsystems– Two kana syllabaries, which we’ll look at later– Chinese characters – kanji which usually havetwo kinds of readings: Sino-Japanese (on ‘sound’) readings: often a givencharacter will have several of these Native Japanese (kunyomi) readings‘mountain’on: sankun: yamaKnight/Sproat‘island’on: tookun: shima鯉‘carpon: rikun: koiWriting Systems, Transliteration and Decipherment15

A generalization of xíngshēng:Vietnamese Chữ Nôm ()Knight/Sproat16Writing Systems, Transliteration and DeciphermentSemantic-phonetic constructions inother ancient scriptsEgyptianSumerian[DIV]NinMayanGalUrim [LOC] maKnight/SproatWriting Systems, Transliteration and Decipherment17

Syllabaries Syllables are often considered more “natural”representations in contrast to phonemes. E.g:– “investigations of language use suggest that many speakers donot divide words into phonological segments unless they havereceived explicit instruction in such segmentation comparable tothat involved in teaching an alphabetic writing system” [Faber, Alice.1992. “Phonemic segmentation as epiphenomenon. evidence from the history of alphabetic writing.” InPamela Downing, Susan Lima, and Michael Noonan, eds, The Linguistics of Literacy. John Benjamins,Amsterdam, pages 111--34.] Syllabaries have been invented many times (true); thealphabet was only invented once (not so clearly true) But: very few systems exist that have a separate symbolfor every syllable of the language:– Most are defective or at least partly segmentalKnight/SproatWriting Systems, Transliteration and Decipherment18Linear B (ca 1600-1100 BC)Derived from an earlier script (Linear A), which wasused to write an unknown language (Minoan)Knight/SproatWriting Systems, Transliteration and Decipherment19

Linear BKnight/Sproat20Writing Systems, Transliteration and DeciphermentCherokee (1821)Sequoyah(George Gist)(1767 - 1843)u-no-hli-s-diKnight/SproatWriting Systems, Transliteration and Decipherment21

KanaKnight/SproatWriting Systems, Transliteration and Decipherment22YiKnight/SproatWriting Systems, Transliteration and Decipherment23

Segmental writing Somewhere around 3000 BC, the Egyptiansdeveloped a mixed writing system whosephonographic component was essentiallyconsonantal – hence segmental One hypothesis as to why they did this is thatEgyptian – like distantly related Semitic – had aroot and pattern type morphology.– Vowel changes indicated morphosyntacticdifferences; the consonantal root remained constant– Thus a spelling that reflected only consonants wouldhave a constant appearance across related wordsKnight/SproatWriting Systems, Transliteration and Decipherment24Egyptian consonantal symbolsKnight/SproatWriting Systems, Transliteration and Decipherment25

Proto-Sinaitic(aka Proto-Canaanite) script Somewhere around 2000 BC, Semitic speakersliving in Sinai, apparently influenced byEgyptian, simplified the system and devised aconsonantal alphabet This was a completely consonantal system:– No matres lectionis – using consonantal symbols torepresent long vowels – as in later Semitic scripts Phoenician (and other Semitic scripts) evolvedfrom this scriptKnight/SproatWriting Systems, Transliteration and Decipherment26Proto-Sinaitic scriptKnight/SproatWriting Systems, Transliteration and Decipherment27

Later Semitic scripts:vowel diacriticsKnight/SproatWriting Systems, Transliteration and Decipherment28The evolution of Greek writing Greek developed fromPhoenician Vowel symbols developed byreinterpreting – or maybemisinterpreting – Phoenicianconsonant symbols The alphabet is often describedas only having been inventedonce.– But that’s not really true: theBrahmi and Ethiopicalphasyllabaries developedapparently independently, fromSemiticKnight/SproatWriting Systems, Transliteration and Decipherment29

Alphasyllabaries: Brahmi(ca 5th century BC)Knight/SproatWriting Systems, Transliteration and Decipherment30Some Brahmi-derived scriptsKnight/SproatWriting Systems, Transliteration and Decipherment31

Basic design of Brahmi-derivedalphasyllabaries Every consonant has an inherent vowel– This may be canceled by an explicit cancellation sign (virama inDevanagari, pulli in Tamil)– Or replaced by an explicit vowel diacritic In many scripts consonant groups are written with someconsonants subordinate to or ligatured with others In most scripts of India vowels have separate full anddiacritic forms:– Diacritic forms are written after consonants– Full forms are written syllable or word initially– In most Southeast Asian scripts (Thai, Lao, Khmer ) thismethod is replace by one where all vowels are diacritic, andsyllables with open onsets have a special empty onset sign. (Wewill see this method used again in another script.)Knight/SproatWriting Systems, Transliteration and Decipherment32Devanagari vowelsInherentvowelKnight/SproatWriting Systems, Transliteration and Decipherment33

Kannada diacritic vowelsKnight/SproatWriting Systems, Transliteration and Decipherment34Knight/SproatWriting Systems, Transliteration and Decipherment35

Knight/SproatWriting Systems, Transliteration and Decipherment36Another alphasyllabary:Ethiopic (Ge’ez) (4th century AD)Knight/SproatWriting Systems, Transliteration and Decipherment38

“Correct sounds for instructing the people”(훈민정음)The origin of Korean Writing“The speech of our country differs from thatof China, and the Chinese characters donot match it well. So the simple folk, if theywant to communicate, often cannot do so.This has saddened me, and thus I havecreated twenty eight letters. I wish thatpeople should learn the letters so that theycan conveniently use them every day.”King Sejong the Great(Chosun Dynasty, 1446)Knight/SproatWriting Systems, Transliteration and Decipherment39Hangul symbolsKnight/SproatWriting Systems, Transliteration and Decipherment40

Design principles of Hangul For consonants based onthe position of articulation Vowels made use of thebasic elements “earth”(horizontal line) and“humankind” (verticalline)Knight/Sproatㄱ “k” looks likethe tongue rootclosing thethroatㅜ “u” as in themiddle sound of“jun”.Writing Systems, Transliteration and Decipherment41Design principles of HangulKnight/SproatWriting Systems, Transliteration and Decipherment42

Summary Writing systems represent language in avariety of different ways But all writing systems represent sound tosome degree While syllabaries are indeed common,virtually all syllabaries require someanalysis below the syllable levelKnight/SproatWriting Systems, Transliteration and Decipherment43Encodings: Unicode Character encodings are arranged into“planes”– A plane consist of 65,536 (1000016) “codepoints”– There are 17 planes (0-16) with Plane 0 beingthe “Basic Multilingual Plane” Texts are encoded in “logical” order, whichis more abstract than the presentationorderKnight/SproatWriting Systems, Transliteration and Decipherment44

Types of code pointsKnight/SproatWriting Systems, Transliteration and Decipherment45Knight/SproatWriting Systems, Transliteration and Decipherment46

Example: Devanagari Code PointsKnight/SproatWriting Systems, Transliteration and Decipherment47Knight/SproatWriting Systems, Transliteration and Decipherment48

Knight/SproatWriting Systems, Transliteration and Decipherment49Example of Logical Ordering: Tamil /hoo/Knight/SproatWriting Systems, Transliteration and Decipherment50

UTF-8 Common encoding of Unicode.– Variable length depending upon which codepoints one is dealing with– Programming languages have libraries thatmake dealing with UTF-8 strings easy.– Makes it easy to mix-and-match text fromvarious sources: Knight/Sproat,,, մայրաքաղաք,Writing Systems, Transliteration and Decipherment51Bidirectional textKnight/SproatWriting Systems, Transliteration and Decipherment52

Unicode encoding schemesKnight/SproatWriting Systems, Transliteration and Decipherment53Knight/SproatWriting Systems, Transliteration and Decipherment54

Issues with Unicode The design principles are nice, but they areinconsistently applied:– In Brahmi-derived alphasyllabaries each consonantand vowel has a separate code point.– Not so in Ethiopic In Indian alphasyllabaries, logical order is strictlyenforced– Not so in Thai and Lao As we saw in the Tamil example, Unicode allowsfor variants for encoding the same information The term ideograph should never have becomeenshrined as the term for Chinese charactersKnight/SproatWriting Systems, Transliteration and DeciphermentPart IITranscription (Transliteration)55

When Languages CollideAt the border crossing (before writing):W UH T ZY ERN EY M ?Phonemic transferAA KH M EH DHAE K M EH D ?N OW,AA KH M EH DHOW K EY, W IY LJH UH S TK AH LY UWAE K M EH DTwospokenformsKnight/Sproat57Writing Systems, Transliteration and DeciphermentWhen Languages CollideAt the border crossing (after writing):I need totype yourname.Here’s mypassport.Textual transferWhat’s this say?It’s a bunch ofsquiggly lines.AA KH M EH DHAE K M EH D ?Argh .Fine.Ackmed.TwowrittenformsKnight/SproatWriting Systems, Transliteration and Decipherment58

When Languages Collide Japanese/English example:KEVINKNIGHTK EH V IH N N AY TKEBINNAITOEnglish writingEnglish soundsJapanese soundsJapanese writing V B: phoneme inventory mismatch T T O: phonotactic constraint alphabetic vs. syllabic writingKnight/SproatWriting Systems, Transliteration and Decipherment59When Languages Collide Common translation problem– People and place names– New technical terms, borrowings Challenging when source and targetlanguages have:– different phoneme inventories– different phonotactic constraints– different writing systems English, Japanese, Russian, Chinese,Arabic, Greek Knight/SproatWriting Systems, Transliteration and Decipherment60

Streets of Tokyo / KatakanaForward vs BackwardTranscription Forward transcription– Import foreign term / name Newt Gingrich may be several ways totranscribe into Arabic– Generally flexible Backward transcription– Recover original term / name– Usually only one right answer Knight/Sproat Newt Gingrich (not Newt Kinkridge)Writing Systems, Transliteration and Decipherment62

Japanese �イで予選落ちとなった。chyado ��ーが通算9アンダーで並んだ。kenii ��目を終え、通算2アンiibunpaataigaa t/SproatWriting Systems, Transliteration and Decipherment63Chinese/EnglishWhat’s myname inJapanese?What’s myname inChinese?Knight/SproatKEBIN.NAITOGreat! I woulddo it like this No! Moreappealinglike this That’s good,but this writtencharacter hasa more pleasingconnotation Hi, what are you guys doing?I brought chips and soda Writing Systems, Transliteration and Decipherment64

Chinese Several hundred syllables in inventory– Must stick to this idiosyncratic set– Washington Hua Sheng Dun– No other syllables are easily written Homophony: after we decide on syllables,many characters to choose from– Washington Hua Sheng Dun Transcription vs Translation– Kevin Knight Nai Kai Wen or Wu Kai WenKnight/Sproat65Writing Systems, Transliteration and DeciphermentTranslation versus Transcription Sometimes things are translated insteadof transcribed– Japanese:computer コンピューター(konpyuutaa)– Chinese:computer 电脑(dian nao) (“electric brain”)– Arabic:Southern California (Janoub Kalyfornya)½ transliterated½ translatedKnight/SproatWriting Systems, Transliteration and Decipherment66

An Interesting Case:What’s Going On Here? Observed English/Japanese transcription:– Tonya Harding toonya haadingu– Tanya Harding taanya haadingu Perhaps transcription is sensitive tosource-language orthography Or perhaps the transcriber is mentallymis-pronouncing the source-languagewordKnight/SproatWriting Systems, Transliteration and Decipherment67A Model of TranscriptionKEVINKNIGHTK EH V IH N N AY TKEBINNAITOEnglish writingEnglish soundsJapanese soundsJapanese writingSuppose we believe these are the steps.We can model each step with a weighted finitestate transducer (WFST), and employ ClaudeShannon’s noisy-channel model.Knight/SproatWriting Systems, Transliteration and Decipherment68

A Model of TranscriptionWFSA AMODELINGDIRECTIONAngela KnightWFST BAE N J EH L UH N AY TWFST Ca n j i r a n a i t oWFST DDECODINGDIRECTION[Knight & Graehl 98]A Model of TranscriptionWFSA ALANGUAGEMODELAngela KnightWFST BSPELLINGTO SOUNDTRANSDUCERAE N J EH L UH N AY TWFST Ca n j i r a n a i t oWFST DPHONEMICTRANSFERTRANSDUCERSOUND TOSPELLINGTRANSDUCER[Knight & Graehl 98]

Spelling to Sound Transducer Richard talked about writing systems. Such a system captures an infinite relation of sound-sequence, writing-sequence pairs.ϵ:KCAT : ϵSCAT : ϵKnight/Sproatϵ:Sϵ : AEWFSTwords soundssounds wordsϵ:Tϵ:ϵWriting Systems, Transliteration and Decipherment71Learning SequenceTransformation ProbabilitiesIdeal training data:etcP(n M) 0.5P(m u M) 0.5need much more data,of courseActual training data:etcAutomatically align string pairs using the unsupervisedExpectation-Maximization (EM) algorithm.

EnglishJapanesephonemictransfer patternslearned fromparallelsequencesAHLearned byEM algorithmaoeiu0.4860.1690.1340.1110.076[Knight & Graehl 98]Lrr u0.6210.362WFST millions moreWFSA A millions moreWFST BAE N J IH R UH N AY TAH N J IH L UH N AY T OH millions moreWFST Ca n j i r a n a i t oWFST DDECODING

A Model of TranscriptionWFSA AAngela KnightWFST BAE N J EH L UH N AY TCan this transformationbe learned fromnon-parallel data?WFST Ca n j i r a n a i t oWFST DI.e., can katakana bedeciphered withoutparallel text?We’ll return to thislater Decipherment sectionIntermission

Alternative: Mapping CharacterSequences DirectlyKEVINKE VI NKNIGHTKN IGH TEnglish writingEnglish letter chunksJapanese writing Dispenses with spelling-to-sound modelsand pronunciation dictionaries Can be learned from parallel data usingstatistical MT-like techniques (overcharacters instead of words)Knight/SproatWriting Systems, Transliteration and Decipherment77Hybrid Mapping Models Sound-based and character-basedmethods can be combined– [Al-Onaizan & Knight 02]– [Bilac & Tanaka 04, 05]– [Oh & Choi 2005, Oh et al 06]Knight/SproatWriting Systems, Transliteration and Decipherment78

Re-ranking TranscriptionCandidates Co-reference can help– Short name may be disambiguated by full versionthat appears earlier in a document Web counts can help– Bell Clinton (6m), Bill Clinton (27m) Context can help– Donald Martin » Donald Marron but:– Donald Martin Lightyear Capital (7)– Donald Marron Lightyear Capital (6000)[Al-Onaizan & Knight 02]Knight/Sproat79Writing Systems, Transliteration and DeciphermentUse of Transcription inMachine Translation Systems What doesn’t work:– Execute named-entity (NE) recognition on source text– Transcribe recognized items– Tell MT system to use transcriptions Often breaks a translation that was perfect before!––––NE recognition is error-fulTranscription is error-fulNot all NEs should be transcribedPhrase disruption Vanilla MT system: “Improved” MT system:Knight/SproatWhole phrasetranslation [f1 f2 f3] [e1 e2 e3] f1 [f2 f3] e5 [e2 e3] Writing Systems, Transliteration and DeciphermentNE ID transcription80

Use of Transcription inMachine Translation SystemsAnother approach [Hermjakob et al 08]Transliteration ModelBilingualTrainingCorpusBilingual corpus, eachside with transliterateditems identified & markedSource side only, withtransliterated items marked(throw away target side)MT systemNew suggestedphrasal translations(not mandatory use)Tagged testcorpusTestCorpusTrained monolingual “transliterate me”tagger (doesn’t just tag names!)Knight/Sproat81Writing Systems, Transliteration and DeciphermentOther Uses ofTranscription Models Cross-lingual Information retrieval, eg, [Gao et al 04]Recognize transcriptions in comparable corpora, eg, [Sproat et al 06]Regional studies, eg, [Kuo et al 09]Automatic speech recognition– Phonemic transfer models might adjust for non-native speakers? Normalization of informal Internet Romanization schemes– Greek, Arabic, Russian rter.htmCypriot Greeklish with InstantMessaging Shorthand:ego n 3ero re pe8kia.skeftoume skeftoume omostpt.Normalized for automatic indexing or Εγώ εν ξέρω ρε παιθκιά.translation:σκέφτουμαι σκέφτουμαι όμωςτίποτα.Knight/SproatWriting Systems, Transliteration and Decipherment82see “Greeklish”, Wikipedia

Overview of theTransliteration/TranscriptionLiteratureWe have only touched on what is a large literature.http://www.cs.mu.oz.au/ skarimi/S. Karimi, F. Scholer, A. Turpin, A Survey onMachine Transliteration Literature, (SubmittedDec 08, Review received 31 Mar 09) UnderRevision for ACM Computing Surveys.Knight/SproatWriting Systems, Transliteration and Decipherment83Discriminative models Often used in judging potentialtranscription pairs in comparable corporasince here one is merely trying to classifythe pair We will briefly review two pieces of work:– Klementiev & Roth 2006– Some results from the 2008 JHU summerwo

Writing Systems, Transliteration and Decipherment 14 A generalization of huìyì : Japanese kokuji () -æ ) Knight/Sproat Writing Systems, Transliteration and Decipherment 15 Japanese logography Japanese writing has three subsystems Two kana syllabaries, which we ll look at later Chinese characters kanji which usually have

Related Documents:

Transliteration editors for Arabic, Persian and Urdu E.Veera Raghavendra, Prahallad Lavanya, Fahmy Mostafa Carnegie Mellon University IIIT Hyderabad, India. Abstract: Transliteration editors are essential for keying-in language scripts into the computer using QWERTY keyboard. Applications of transliteration editors in the context

Sanskrit (Devanāgari) Alphabet Transliteration TRANSLITERATION Transliteration is the magic key that opens the door to Sanskrit. It is how the Devanāgari can be put into characters that everyone can read. It is also the means by which you can look up the Monier Williams online dictionary. If you download a program called Itranslator

project. This transliteration “booklet” is the first step of that effort. The initial transliteration was produced from computer files of the Hebrew text by a computer program written by Lee Nackman. The program attempts to reproduce Rabbi Aigen’s style of transliteration, as

The state of decipherment of proto-Elamite 101 paleographic and semiotic progression of the cuneiform sign repertory into periods, beginning with the Early Dynastic IIIa period c. 2600–2500 BC, whose administrative and literary documents are increasingly comprehen-

tion from Indian language to English. Named-entity transliteration pairs mining from Tamil and English corpora has been performed earlier using a linear classifier (Saravanan and Kumaran, 2008). Sajjad et al. (2012) have mined transliteration pairs independent of the language pair us

English to Kannada transliteration systems were modelled using two different methods. The first transliteration model is based on a rule based approach where as the other transliteration model is based on statistical approach. In the first method rules were generated automatically using WEKA‟s C4.5 decision tree .

of English alphabet as ‘Prod y ut’, b ut n ot ‘Prod d ut’ since Bangla Ô Õ- djv is logically transliterated in English as ‘y ’ and ‘t̪’ is a dental sound th at is redundant for transliteration . Hence, transliteration replaces, transfers, or

Taking responsibility for our own physical, emotional, mental and spiritual well-being can be a radical political act in these times where legislation and standardised medical practice often support or even create ill-health. Also, rapid cultural change has been facilitated through access to personal computer technology. It is now easy to find ‘alternative versions’ of events, both .