Compiling a Partition-Based Two-Level Formalism
6 Pages
English

Compiling a Partition-Based Two-Level Formalism

-

Downloading requires you to have access to the YouScribe library
Learn all about the services we offer

Description

a a t a A a a 1 a a College) A B no. a 92313384. a a College) College. a a Compiling Partition-Based Two-Level Formalism Edmund Grimley-Evans* George Anton Kiraz Stephen G. Pulman University of Cambridge University of Cambridge University of Cambridge (St John's (St John's Cointmter Laboratory Computer Laboratory Computer Laboratory Cambridge CB2 3QG, UK Cambridge CB2 3QG, UK Cambridge CB2 3QG, UK and SRI International, Cambridge Edmund. Grimley-Evans@cl. cam. ac. George. Kiraz@cl. cam. ac. uk sgpOcam, sri. com Abstract rages to Koskenniemi's notation. These are de- tailed more fully in (Black et al., 1987, pp. 13-15), This paper describes an algorithm for the and in (Ritchie et al., 1992, pp. 181-9). In brief: compilation of two (or more) level or- (1) Koskennienli rules are not easily interpretable thographic or phonological rule notation (by tile grammarian) locally, for the interpretation into finite state transducers. The no- of 'feasible pairs' depends on other rules in the tation is an alternative to the standard set. (2) There are frequently interactions between one deriving from Koskenniemi's work: rules: whenever the lexieal/surface pair affected it is believed to have some practical de- by rule appears in tile context of another rule scriptive advantages, and is quite widely B, the grammarian must check that its appearance used, but has different interpretation. in rule will not conflict with the requirements of ...

Subjects

Informations

Published by
Reads 17
Language English
is applied ill parallel at each point in the tSupported by Benefactors' Studentship from St input. John's 454 1;o uk S A c A A en x t;he x an(t ln~ a = |;ire [ 1.994). l a ~"r 2 l n - N l M a n = N ... q N M I E [:or'rH U x x x ... l Tl~e l x X (E,~ X U X a l n a A I A SubA,,~ n L = l~n; a [ a X th('. G a X ) C 1 C A s rule, &lid ~ x a.ud : a ... E~ I ( = a r rN+M) × = [-1 surt'~(:e X c X ,1N+M) = = a form) X 1N> I = a S cl x a ... 6 x {(,% x x x × x al-. = SUBA,,~L M L X x ... 9 x a The partition tbrmalism coImists of two types the set of string-tuples representing possible con- of rules (defined in more detail beh)w) which en- tents of the tapes. proI)er subset of regular force optional or obligatory changes. notion n-relations have the property that they are ex- of well-formedness is defined via the notion of pressible as the Cartesian product of regular 'partition' of sequence of lexical/surface corre- languages, we call such spondences, informally, partition is valid anal- lations 'orthogonal'. (W('. present our detinitions ysis if (i) every element of the t)artition is licensed along tire lines of (Kat)lan and Kay, 1994)). by an optional rule, and (ii) no element of the We use two regulm" ot)erators: Intro and Sub. partition violates an obligatory rule. IntrosL denotes the set of strings in into which We have tbund that this formalism has some elements of may be arbitrarily inserted, and practical adwmtages: (1) The rules are relatively denotes the set of strings in ill which independent each other. (2) Their interpreta- substrings that are in /3 may be replaced by tion is more familiar for linguists: each rule copes strings from A. Both operators map regular lan- with single correspondence: in general you don't guages into regular languages, because they can have to worry about all other rules having to t)e t:haract(!rise(1 by regular relations: over tim (:ompatible with it. (3) Multiple character changes phabct E, Intros. (Idz ({el S))*, art permitted (with some restrictions discussed (Id>] (/3 A))*, wtiere IdL L}, below). (4) category or term associated with the identity relation over L. each rule is requi,'e(t to uni(y with the affected There are two kinds of two-level rules. The con- morpheme, allowing for morI)ho-synta(:tic etfects text restriction, or optional, rules, consist of left to be cleanly described. (5) There simple and context centre c, and right context r. Surface etfMent direct interpreter for tt,e rule forrnalism. coercion, or obligatory, rules require the centre to Tile partition formalism has been implemented be split into lexical and surface c, compolmnts. in the European Commission's ALEP system tbr Detinltion 2.1 N:M context restrietion natural language engineering, distributed to over (CR) rule is triple. (/,,c,r) where l,c,r are 30 sites. Descriptions of EU languages arc 'orthogonal' regular relations of the form :: t)eing develot)e(1. version has also im- ... (: Cl ... (:~ ?' ?'1 ... plemented within SI{.I's Core l,anguage Engine (Carl;er, 1995) and has been used to develot) de- Definition 2.2 N:M surface coercion (SC) scriptions of English, French, Spanish, Polish, rule quadruple (/,c/,c~,r) where and Swedish, and Korean morphology. An N-level ex- are 'orthogonal' regular relations of tile form tension of the formalism has also been developed ... l'n, an(t by (Kiraz, 1994; Kiraz, 1996b) arrd used to are 'orthogonal' regular relations restricting only scribe morphology of Syria(: and other Semitic the lexical and surface tapes, respectively, of the languages, arrd by (Bowden Kiraz, 1995) for ... ~N-} and error dete(',tion in noncon(:atenative strings. This ... ).]~ (W+M. [] 1)m.'tition-l)ased two-level formalism is thus seri- We usually use the following notation tbr rules: ous riwll to the standard Koskcnniemi notation. LLC l,I.;x RI,C ~>[¢-[¢> lIowever, until now, the Koskenniemi notation i,SC has had one clear advantage in that it was clear how compile it into transducers, with all the where consequent gains in etliciency and portability and I,LC (lel't l(,xi,:al corlt,,~t) (~,..., with ability construct lexical transducers LEX (lexical lion. II,S(? (right conl;cxt) (rN+l,..., 1. Sm in t)racticc all the left conl;(;xts start Definition of the Formalism with all the right contexts end with L*, 2.1 Formal Definition we omit wril;ing it and assume it by default. The We use tapes, where tim first tapes are operators are: for CII. s, for S(] rules and ]exical and the remaining are surface, -- for coInposite rules. M. In practi(:e, :: We write Ei prot)osed morphologit:at analysis is an for the alphabet of sylnbols used on tape i, and tuI)le of strings, and rules are intert)reted as :: (Er {el) {c}), so that E* is applying section of this analysis in conl;ext: 455 ~o ,~- 1. 4> *{= ?'i I.i "e ~N 7£ de.-. Cl ?" 1.n~ ix '1"~,. be, 1, ix .s' a') tO =- tO be, ot: re= 1~1 H. i rule. V = B b}, * C - a i E + a < E a ] B - aft;re' pro._ = tit(; B ..., X - B = E n • = B A * C19, - (;' P (1, a - a C x A 1 a = (2 B a e, e P E = (b, B E B R~=) A P,. 1 E A r - (R=>, - a B G b c. * v a ~ V ¢ tit(; = ...(.(I, w a a is E v a x c,., u P1 j = + P A A j a R~=) a E a i a a G X : = P,. i E j d E b i c 0 d a (b, j ¢ E c i * A b O b ¢> l}l~,t~ (n-way concatenation of left con- R3: text, centre, aim right context). Formally: Defiinition 2.3 rule c, r) contextually R1 and R2 illustrate the iterative application of allows (1}, Pc, P,.) iff l, and rules on strings: they sanction the lexical-surface [] strings (VBBB,Vbbb), where the second (B,b) pair serves as the centre of the first application Definition 2.4 An SC rule (l, cl, r) coer- of R2 and as the left context of the second ap- cively disallows (Pt, P~,Pr) iff l, r, plication of the same rule. is an cpenthetic P,. Ecl and [] rule which also demonstrates centres of unequal Definition 2.5 N:M two-level grammar is length. (We assume that rule (l, c,r) ERa, we can make this representa- tuitive for the user when insertion rules are used. tion unique by defining canonical way of convert- For example, the rule (E* (g, g), E~, E~ v, E*) ing each such possible centre into same-length ('change to g') would not disallow string-tuple 6'. simple way of doing this is to string-tuple partitioned as g), (e, c), (u, u)... pad with 0s the right making each string as long assmning some CR rule allows (e, e). as the longest string in C: if (Pl, ...,pn), Earlier versions of the partition fbrmalism could (>0", ...,p,,0*) z*(0, 0) (1) not (in practice) cope with multiple lexical char- actors in SC rules, see (Carter, 1995, §4.1). This However, since we know set of possible is not case here. titions it is U{c ~l,r(l,(-,'r} 1{:,}- we can The tbllowing rules illustrate the formalism: reduce the number of elements of in use, and hence silnplify the calculations, by inserting the 0s RI: in more flexible manIter: e.g., if -- (ab, let (ab, rather than (? (ab, b0): assuming another requires us to use anyway, we R2: only haw; to add (a, 0) rather than (a, and 0). 456 b} b} => Ob) => >~* at, 'u w~ co ...P~: 1; t;o ]~,<=), (1~_~, c~. P,~ R,a 1} P~ P~ {co}, 0 d 3 rC*Tre* S c a - e c o D U a " V " 17,¢_): > a o [o [_J T = , & a a ~ ,c., a ( d - : (l°co(cp, = o a U 1 j x = t() c a 0 × co, U 0 c - .° (/t/co(c'/' G e a = O /~,<:, r I a C in c = 6 c Cs, d a ~ > I (l,~,r)C51i, u D (/~,=>, set; a C b}* D x 6 x E x ~ × a ] The 1)reprocessor could use simI)le heuristics to If we subtract term like this tbr each make such decisions. In any case, the padding of Dorn our initial approximation (eq. 4), then we t)ossibl(, t)artitions (:arries over to the (:entres of have all ext)ression for tile set of strings allowed by the CR rules of tile gralnlnar: ca r,les: (l,,-,,-) {0 c}. Itencetbrth let be l;he of elements of that appear in seine 0-padded rule centre. The contexts of all rules and the lexical and surface Celltres of SC rules Inust l)e converted into same-length regular n-relations |)y inserting 0s at a]l 1)ossible positions on each tape independently: if a; 2; ... xnt ]t remains to enforce the sm'fime coercion rules °= (Introto}xt ... Intro{o}xn)Fire* (2) /~-. For given SC ruh; ct, r) first Note the difference between this insertion of al)llroxinmtion tim set of strings in which this everywhere, denoted °, and the canonical rule is violated is given by: padding Both r('quire the 'orthogonality' condi- Intro{w} 0) tion in ord('r for the intersection with to yield Here c~) is the set of strings that match regular language: inserting into (a, at the lexical centre but, do match the surface all possibh; l)ositions on each tape iIulependently centre. For part (2) of Definition 2.6 to apply this would give non-regular relation, for examt)le. must equal the concatemttion of or more adja- Now we derive formula ibr the set of O-padded cent partitions, hence it has on each side of the ;rod partitioned analysis st;rings accepted by the partition separator and the operator Intro iil- grammar The set of O-pa(l(ted centres troduces additional partition separators into tile of context; restriction rules ix given by: contexts and the centre. The only case not yet c,,,..(L,c,,.) (s) {:overed is where dm centre matches a([jacent th,re we assume that these centres are disjoint partitions (i in part (2) of Definition 2.6). (Vc, .l).c f] 0), because in prac- This can be dealt with by prefixing witll the sub- tice each singleton set,, however tiler(; is an stitution operator so the set of strings in alternative deriw~tion that does not require this. which one of the SC rules is violated is: We proceed subtraetively, starting as an initial Sub~o,~o~o c(:)cor °) approximation with an art)itrary concatenation of )(. the possible l)artitions, i.e. the (:entres of Cl/, rules: co(Dee)* (4) We subtract; this too fl'om our aporoxin, (eq. 8) in order to arrive at formula for the set From this we wish to subtract tim set of strings of 0-padded and partitione(l strings that are ac- containing t)artition that is not allowed by any CR rule: We introduce new placeholder symbol (:epted Iiy the grammar: T, to represent the centre of rule, co(Dw)*- Sub~<~ so the set of possihle contexts for given centre ~GI) is given by: [.J z%, (s) (/,~,r)~ll (/,d,r)6 IL So the set, of contexts in wlfich the centre may Subw,ww not, al)t)ear is the comlflement of this: I('T"'° Intro{~} (/°co(el '-- '°) (1]) (t,,Lr)<1~ Now we can introduce t;he partition sel)arator Finally, we can replace l;he partition separator throughout, then substitute the centre itself, w&o, anti the st)ace sylnbol 0/)y to convert So into for its placeholder in order derive an expres- regular (but no longer same-length) relation sion for the set of partitioned strings in which an that maps t)etween lexical and surface representa- instan('e of the centre al)l)ears ill context in tions, as in (Kaphm and Kay, 1994, p. 368). which is 'not allowed: denotes comt)osition Algorithm and Illustration This section goes through the compilation of the samI)le grammar in section 2.1 step by step. (7) 457 it, co co c[~)co, (6) -- 7c a~l(ur{ lt<~ ,.,. I,,~, Intro{~} Sub~o,o0w, it, 11ol, Os (r,'(~) 7r* (.9) Cs°)cor -- L'. I;o a, (1, 7c it" 1), {(V,V>, rlr* (3) A D lk a : 2 expression ~ B:b (1) a 10). a 11), D 4 D 371). = = 1This x in: (13) is: p. (2) = - Intro{o}(E~) (1) x (15) a N - = 1994) 1 - k, for each context pair rk of specific centre which take the place of the contexts. Then they ensure that each k:>k only oc- Phase deals with CR rules. We have two cen- curs if followed by its context rk. The automaton tres to process: (B,b) (R1 R2) and (0,b) (R3). which results after compiling the two rules For each centre, we compute the set of invalid con- V:V V:V texts in which the centre occurs (eq. 7). Then we subtract this from FSA1 (eq. 8), yielding FSA2. V:V "" V:V >1:>1 FSA2 Removing symbols The third phase deals with SC rules: here the portion of R3. Firstly, we compute the set of strings in which R3 is violated (eq. Secondly, we subtract the result from FSA2 (eq. re- B:b V:V sulting in an automaton which only differs from FSA2 in that the edge from to is deleted. Our algorithm produces this machine directly. Compiling Koskenniemi's formalism is compli- Comparison with Previous cated by its interpretation: rules apply to the en- Compilations tire input. partition rule is concerned only with the part of the input that matches its centre. This section points out the differences in compil- ing two-level rules in Koskenniemi's formalism on an expansion of Restrict in (Kaplan and Kay, one hand, and the one presented here on the other. 458 qo q5 results auxiliary all .'2:>2 ~:<, >~:>~ >~:>~ >k:>k Ik ~c*c c~r* 7r*l O's E~. E1 (1994). 0 (1995). x (1992). 1 202-9. a a b Resem'ch ) 411. x Linguis- 11 D. 5 S., COLING-90, Approach (1990). a level Rule (1989). Beesley, Mass. COLING-9~, English 20(3):331 Mechanisms e Practical ical Morphology: phographe.mic Computational F()rmalisms (1992). a I{ussell, a A., Computational and (1996b). Speech (1994). Computer Compiler. a Two-Level feat{{re-based a A A (1993). 406 M. ers. Study, L. Design tics, Machine Computational Virtual (1994). and EACL-95, Formalism (1995). Rule ACL-95, al, A level EACL-87, (1991). (1987). Helsinki. A., Two-Level a (1983). a Linear a Non- L. small-scale prototype of the algorithm has 4.2 Conditional Compilation been implemented in Prolog. The rule compiler Compiling epenthetic rules in the Koskenniemi mak(;s use of finite-state calculus library which formalism requires special means; hence, the algo- allows the user to compile regular expressions into rithm is conditional on the type of tim rule (Ka- automata. The regular expression language in- plan and Kay, 1994, p. 374). This peculiarity, in cludes standard operators in addition to the op- the Koskenniemi formalism, is due to the dual in- erators defined here. The system has been tested terpretation of the symbol in the parallel formal- with number of hypothetical rule sets (to test isin: it is genuine symbol in the alphabet, yet it the integrity of the algorithm) and linguistically acts as the empty string in two-level ext)ressions. motivated morphological grammars which make Note that it is the duty of the user to insert such use of multiple tapes. Compiling realistic descrip- symbols as appropriate (Karttunen and Beesley, tions would need more efficient implementation 1992). in more suitable language such as C/C++. This duality does not hohl in the pm'tition ]Alture work includes an extension to simulate formalism. The user can express lexical-surface restricted torm of unification between categories pairs of unequal lengths. It is the duty of the rule associated with rules and morphemes. compiler to ensure that all expressions m'e of equal length prior to compilation. With CR rules, this References is done by padding zeros. With SC rules, howew;r, the Intro operator accomplishes this task. There Black, Ritchie, G., Pulman, and Russell, G. for morphographemic descrip- is subtle, but important, (lifl~rence here. tion. In pp. Consider rule R3 (§2.1). The 0-padded centre Bowden, T. and Kiraz, G. mor- of the CR portion becomes (0,b). The SC portion, model for error correction in noncon- however, is computed by the expression catenative strings. In pp. 24-30. Insert{0}(() Insert{0l(b) (16) Carter, Rapid development of morpholog- descriptions for full language processing sys- yielding automaton (a): tems. in pp. Kaplan, R. and Kay, Regular models of phonological rule systems. Karttunen, Constructing lexical transduc- In pp. Karttunen, and K. Palo Alto Center, Xerox Cot- poration. Kiraz, G. Multi-tape two-level morphology: case study in Semitic non-linear morphology. In <0,0> pp. Kiraz, G. If the centre of the SC portion had been padded PhD thesis, University of Cam- bridge. with 0's, the centre wouht have been Koskenniemi, K. PhD Insert{0}(0 Insert{0}(/,) (17) thesis, University of yielding the undesired automaton (b). Both are Puhnan, Two morphology. In Alshawi similar except that state is final in the former. et. ET6/I Taking (a) as the centre, eq. 10 includes (cd,cd); chapter CEC, Luxembourg. hence, eq. excludes it. The compilation of our Pulman, and Hepple, formalism for two-level phonology: description rules is not conditional; it is general enough to and implementation. Lan- cope with all sorts of rules, epenthetic or not. 7:333 Ritchic, G., Black, G., and Puhnan, Conclusion and Future Work for Lexicon. MIT Press, This paper showed how to compile the partition Cambridge formalism into N-tape automata. Apart from in- Ruessink, H. Two formalisms. Technical creased efficiency and portability of impleinenta- Report Utrecht Working Papers in NLP. lions, this result also enables us to more easily Trost, H. The application of two-level mor- relate this formalism to others in the field, using phology to non-concatenative German morphology. the finite-state calculus to describe the relations In Karlgren, H., editor, pages 371-6. implemented by the rule compiler. 459 5, the S. 58. guage, S. 5. qo S. Morphology. Morphology. to 18{}-6. COLING-9]~, A~,, Any 78. M. 8. 11