constable2-tutorial
26 Pages
English

constable2-tutorial

Downloading requires you to have access to the YouScribe library
Learn all about the services we offer

Description

Unicode Character Encoding of Archived Linguistic DataPart 1: Tutorial on Key Unicode ConceptsPeter ConstableSIL Internationaliopeter_constable@sil.orgwww.sil.orgIntroductionPublished data archive requirements:chive requirements: Documented protocols / encodings for data umented protocols / encodings for data representation Common protocols / encodings Durable protocols / encodingsCharacter encoding: Past: multiple encodings, including non-standard Current Best Practice: Unicode©NRSI/SIL CTC 2000IntroductionCharacter encoding: Past: multiple encodings, including non-standard Meaning documented only by font Current Best Practice: Unicode UniversUniversaal character set Becoming dominant IT standardi widely documented Assumed by many other IT standards XML©NRSI/SIL CTC 2000IntroductionOverview: Key Unicode concepts to understandtand Specific issuesues©NRSI/SIL CTC 2000Key ConceptsUnicode-related concepts to understand: Encoding forms Character-glyph model & “smart font” rendering-gly Abstract character repertoire Character properties Alternate representations & normalization Compatibility characters Private-Use characters©NRSI/SIL CTC 2000Unicode Encoding FormsThe Unicode Standard does not provide just one encoding form but threeoMulti-tiered character model: Universal character repertoire Coded character set: Unicode scalar values U+0000 to U+10FFFF Encoding form: specific computer ...

Subjects

Informations

Published by
Reads 9
Language English
Unicode Character Encoding of Archived Linguistic Data
Part 1:Tutorial on Key Unicode Concepts
Peter Constable
SIL Internat onal
peter_constable@sil.org
www.sil.org
Introduction
Publisheddataarchive requirements:  Documented protocols / encodings for data representation  Common protocols / encodings  Durable protocols / encodings
Character encoding:  Past: multiple encodings, including non-standard  Current Best Practice: Unicode
©NRSI/SIL
CTC 2000
Introduction
Character encoding:  Past: multiple encodings, including non-standard Meaning documented only by font  Current Best Practice: Unicode Universal character set Becoming dominant IT standard widely documented Assumed by many other IT standards XML
©NRSI/SIL
CTC 2000
Introduction
Overview:
 Key Unicode concepts to understand
 Specific issues
©NRSI/SIL
CTC 2000
Key Concepts
Unicode-related concepts to understand:  Encoding forms  Character-glyph model & “smart font” rendering  Abstract character repertoire  Character properties  Alternate representations & normalization  Compatibility characters  Private-Use characters
©NRSI/SIL
CTC 2000
UnicodeEncoding Forms
The Unicode Standard does not provide just one encoding form but three
Multi-tiered character model:  Universal character repertoire  Coded character set: Unicode scalar values U+0000 to U+10FFFF  Encoding form: specific computer representation 8-bit data type: UTF-8 16-bit data type: UTF-16 32-bit data type: UTF-32
©NRSI/SIL
CTC 2000
UnicodeEncoding Forms
Examples:  a
Abstracta character
ب
scalar valueU+0061
Encodings: UTF-320x00000061 UTF-160x0061 UTF-80x61
©NRSI/SIL
ب
U+0628
0x00000628 0x0628 0xD8 0xA8
U+12A2
0x000012A2 0x12A2 0xE1 0x8A 0xA2
CTC 2000
Characters vs. glyphs
Character: unit of abstract textual information
Glyph: graphic shapeusedfor thepresentation of character
Character:
©NRSI/SIL
lGyphs:
a
a  a
aa
a
a
CTC 2000
Characters vs. glyphs
Not generally 1:1
L + + Ä ÿÿÿ
       ه    
Q
Unicodeassumes use of “smart-font” rendering system that  Maps abstract characters into glyphs  Handles selection, positioning of glyphs
©NRSI/SIL
CTC 2000
TTELE REIPAC LATLAA N TIU14+0ELTTREC  NMSLA L063 LATING<U+0 ,
AbstractCharacter Representation
©NRSI/SIL
Unicodecharacter repertoire:  Abstractunits ofinformationforrepresentationof textual data  Does not directly encode orthographies
c̃̃< U+0063 LATIN SMALL LETTER C, U+0303 COMBINING TILDE >  
ý Ý, ý Þ
ch
U+0068 LATIN SMALL LETTER H >
CTC 2000
AbstractCharacter Representation
Unicode allows productive dynamic composition
©NRSI/SIL
< U+0063 LATIN SMALL LETTER C, U+0324 COMBINING DIAERESIS BELOW, U+032A COMBINING BRIDGE BELOW, U+0303 COMBINING TILDE, U+0306 COMBINING BREVE, U+0301 COMBINING ACUTE ACCENT >  
CTC 2000