grasp.tutorial.nb
22 Pages
English

grasp.tutorial.nb

-

Downloading requires you to have access to the YouScribe library
Learn all about the services we offer

Description

Grasp Corpus tutorialA cool introduction to the Grasp CorpusJean-Paul Sansonnet — July 2003ã1. What is the Grasp Corpus?The Grasp Corpus was built in order to provide a support to the Grasp Semantic Analyzer [currently under constructionin our team] but it can be used for other purposes. The aim of the Grasp Corpus was to provide a full grammaticalcoverage rather than a full lexical coverage of English. Hence, this corpus was not collected from natural data but iscomposed of "constructed sentences" that is artificial sentences, like one can found in grammar books. However thesesentences are quite realistic of the utterances one can encounter in the process of human / human dialogues.Consequently, the Grasp Corpus cannot be considered as typical in terms of- lexical data : the statistical distribution of the words in the Grasp Corpus is not that of natural communicative interac −tion,- semantic data : the statistical distribution of the actions in the Grasp Corpus is not that of natural communicativeinteraction.On the contrary, the Grasp Corpus has been carefully constructed to cover the main grammatical rules of the Englishlanguage- 136 mains units (or main grammatical rules) are considered,- for each main rule, an average of 5 subcases is provided, which amounts to about 700 grammar rules,- for each main rule an average of 50 sentences featuring the rule are provided, which amounts to more than 7000sentences in the corpus.Indeed, as a sentence can ...

Subjects

Informations

Published by
Reads 62
Language English
ã
Grasp Corpus tutorial
A cool introduction to the Grasp Corpus
Jean-Paul Sansonnet  July 2003
1. What is the Grasp Corpus? The Grasp Corpus was built in order to provide a support to the Grasp Semantic Analyzer [currently under construction in our team] but it can be used for other purposes. The aim of the Grasp Corpus was to provide a full grammatical coverage rather than a full lexical coverage of English. Hence, this corpus was not collected from natural data but is composed of "constructed sentences" that is artificial sentences, like one can found in grammar books. However these sentences are quite realistic o the utterances one can encounter in the process o human / human dialogues. Consequently, the Grasp Corpus cannot be considered as typical in terms o - lexical data : the statistical distribution of the words in the Grasp Corpus is not that of natural communicative interac tion, - semantic data : the statistical distribution of the actions in the Grasp Corpus is not that of natural communicative interaction. On the contrary, the Grasp Corpus has been carefully constructed to cover the main grammatical rules o the English language - 136 mains units (or main grammatical rules) are considered, - for each main rule, an average of 5 subcases is provided, which amounts to about 700 grammar rules, - for each main rule an average of 50 sentences featuring the rule are provided, which amounts to more than 7000 sentences in the corpus. Indeed, as a sentence can feature several rules, a given grammatical rule is represented by much more than 50 sentences; e.g. the negation rule is featured by more than 1400 sentences in the Grasp Corpus. The Grasp Corpus is based on the very well known grammar book "English grammar in use" by Raymond Murphy, Cambridge University Press, from which the grammatical categories and the sentences are inspired: - the main rules of the Grasp Corpus are based on the 136 grammatical Units of the Murphy's, - the sentences of the Grasp Corpus are taken from the examples provided by the Murphy's. This is the reason why the rights of the Grasp Corpus are reserved . (please mail jps @ limsi.fr for more information).
2. Basic access to the Grasp Corpus
2.1. Load from the grasp.corpus.mx file The Grasp Corpus is not stored as a simple list o textual sentences, because for each sentence, it provides: its full string form, -- its tagged form: the Part of speech tagging is achieved by TreeTagger, - the grammatical category that the sentence is supposed to illustrate. The utility functions of the Grasp Corpus are stored in a .m Mathematica file and are to be loaded first. The data of Grasp Corpus are stored in a Mathematica compacted file ( .mx) so that it can be loaded in RAM memory in a very short time (less than a second) of three predefined Mathematica functions (str, tag, unit). This file is named "grasp.corpus.mx" and is in the appropriate folder. So we will have to execute something o the following form: _ È GRASPCORPUSPATH = "D:\\Mes documents\\ WORK\\GRASP\\"; Get @ GRASPCORPUSPATH <> "grasp.corpus.m" D H ∗∗ loading of the utility functions ∗∗ L Get @ GRASPCORPUSPATH <> "grasp.corpus.mx" DH ∗∗ loading of the data functions ∗∗ L 2.2. Quick reference When loaded, the Grasp corpus exports global variables and global functions o the form GCxxxx where "GC" stands for "Grasp Corpus". We give here a very short list o those resources which will be fully explained below ã Grasp global variables GClength total number o sentences in the corpus GCcharlength total number of characters of the sentences of the corpus GCmaxunit total number of grammatical units GCglobalunitdefinitions list of the textual description of the main grammatical chapters GCunitdefinitions list of the textual description of the grammatical units GCwords list of the words occurring in the corpus (not repeated and in alphabetic order) GClemma list of the lemma occurring in the corpus (not repeated and in alphabetic order) GCcorpustags list of the Penn TreeBank tags occurring in the corpus (not repeated and in alphabetic order) GCgramtags list of the tags for the gramwords cluster GCgramwords list of the words in the corpus which are considered to be "grammatical" (not repeated and in alphabetic order) and respectively for the following clusters : adverbs: GCadverbtags, GCadverbwords adjectives: GCadjectivetags, GCadjectivewords verbs: GCverbtags, GCverbwords nouns: GCnountags, GCnounwords
J-P. Sansonnet
2
grasp.tutorial.nb
ã Grasp functions GCstring[i] returns the string o the i-th sentence o the corpus GCtag[i] returns the tagged form of the i-th sentence of the corpus GCunit[i] returns the number of the unit grammatically covering the i-th sentence of the corpus GCunitdef[k] returns the textual description of the k-th unit GCwordcount[s] returns the number of occurrences of word s GClemmacount[s] returns the number of occurrences of lemma s GCsee[i | {i,..}] displays the information about the i-th sentence or a list of sentences GCseeunit[i] displays the information of all the sentences covered by grammatical unit i GCtaglist[s | {s,..}] function calling directly TreeTagger with sentences not in the corpus where: i is an integer such that 1 < i < GClength k is an integer such that 1 < k < GCmaxunit s is a string
3. Basic access functions
3.1. String form of the sentences To obtain the string o the first sentence o the corpus, just type GCstring @ 1 D Ann is in her car This is a list o the ten first sentences o the corpus: GCstring ê @ Range @ 10 D 8 Ann is in her car, she is on her way to work, she is driving to work, this means that she is driving now, the action is not finished, I am driving, please don't make so much noise, where's Margaret?, she is having a bath, let's go out now < Note: as a Mathematica convention strings don't appear between "" when printed but is used in their internal form (as shown with the FullForm function) FullForm @ GCstring @ 1 DD "Ann is in her car" 3.2. Tagged form of the sentences ã The GCtag function For each sentence, one can obtain its tagged form by typing GCtag @ 1 D 8 B @ 1, , D , NP @ 2, Ann, Ann D , VBZ @ 3, is, be D , IN @ 4, in, in D , PP$ @ 5, her, her D , NN @ 6, car, car D , B @ 7, , D< In order to provide the part o speech tagging we used, with permission, TreeTagger from Helmut Schmid [http://www.ims.uni-stuttgart.de/~schmid/] which makes use of the Penn Treebank Tag Set tagging conventions. - More information on Treetagger can be obtained at : http://www.ims.uni-stuttgart.de/projekte/corplex/Tree Tagger/DecisionTreeTagger.html - More information on Penn TreeBank can be obtained at : http://www.cis.upenn.edu/~treebank/home.html For each sentence, for each word or punctuation sign in the sentence, the tagging o the word or sign is o the following form TAG [ n , WORD , LEMMA ]
J-P. Sansonnet
3
grasp.tutorial.nb
where - TAG is the Penn Treebank tag associated with the word or sign, - n is the integer position of the word in the sentence, - WORD is the string of the original word or sign in the sentence, - LEMMA is the string of the lemmatized form of WORD. In some cases, WORD and LEMMA can be empty strings "". Moreover, each sentence is bordered at the begin and at the end by a "border word" tagged as B[n, , ] where the blanks are the display o empty strings "" . The Penn Treebank tags actually used in the corpus are given by the global variable GCcorpustags GCcorpustags Length @ % D 8 B, CC, CD, DT, EX, FW, IN, JJ, JJR, JJS, MD, NN, NNS, NP, NPS, PDT, POSS, PP, PP$, QUEST, RB, RBR, RBS, RP, SENT, TO, UH, VB, VBD, VBG, VBN, VBP, VBZ, VIRG, WDT, WP, WP$, WRB < 38 ã Direct call to TreeTagger One can directly call TreeTagger with its own personal sentences by using the function GCtaglist - GCtaglist[s] tagging of a single sentence - GCtaglist[{s1,s2,..}] tagging o a list o sentences GCtaglist @ "Diamonds are a girl best friends" D 88 B @ 1, , D , NNS @ 2, Diamonds, diamond D , VBP @ 3, are, be D , DT @ 4, a, a D , NN @ 5, girl, girl D , JJS @ 6, best, good D , NNS @ 7, friends, friend D , B @ 8, , D<< Note: Beware that the functions GCstring, GCtag etc. are not filled for the sentences tagged by GCtaglist; moreover the sentenc es are not appended to the corpus. ã The Penn Treebank Tag Set Note: this is extracted from the Penn Treebank page on the Internet from: http://www.cis.upenn.edu/~treebank/home.html The tagset used in tagging the demo corpus available here is the Penn Treebank Tag set, described for example in Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz: Building a Large Annotated Corpus of English: The Penn Treebank, in Computational Linguistics, Volume 19, Number 2 (June 1993), pp. 313--330 (Special Issue on Using Large Corpora). The tagging was done at UPenn. The following part-of-speech tags are used in the corpus: 1. CC Coordinating conjunction 2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition or subordinating conjunction 7. JJ Adjective 8. JJR Adjective, comparative 9. JJS Adjective, superlative 10. LS List item marker 11. MD Modal 12. NN Noun, singular or mass 13. NNS Noun, plural 14. NP Proper noun, singular 15. NPS Proper noun, plural 16. PDT Predeterminer 17. POS Possessive ending 18. PP Personal pronoun 19. PP$ Possessive pronoun 20. RB Adverb 21. RBR Adverb, comparative 22. RBS Adverb, superlative 23. RP Particle 24. SYM Symbol 25. TO to 26. UH Interjection 27. VB Verb, base form 28. VBD Verb, past tense 29. VBG Verb, gerund or present participle 30. VBN Verb, past participle 31. VBP Verb, non-3rd person singular present 32. VBZ Verb, 3rd person singular present 33. WDT Wh-determiner
J-P. Sansonnet
4
grasp.tutorial.nb
34. WP Wh-pronoun 35. WP$ Possessive wh-pronoun 36. WRB Wh-adverb IMS Stuttgart / WWW@IMS.Uni-Stuttgart.DE / Tue May 19 18:04:13 1998 (hofmanaa) So we have the 36 Penn tags penntags = 8 CC, CD , DT, EX, FW, IN, JJ, JJR, JJS, LS, MD, NN, NNS, NP, NPS, PDT, POS , PP, PP$ , RB, RBR, RBS, RP, SYM, TO, UH, VB, VBD, VBG, VBN, VBP, VBZ, WDT, WP, WP$, WRB < 8 CC, CD, DT, EX, FW, IN, JJ, JJR, JJS, LS, MD, NN, NNS, NP, NPS, PDT, POS, PP, PP$, RB, RBR, RBS, RP, SYM, TO, UH, VB, VBD, VBG, VBN, VBP, VBZ, WDT, WP, WP$, WRB < The difference between the Penn Treebank Tag Set and the tagset used in the Grasp Corpus is given by - the tags belonging to the Penn Treebank Tag Set but not in the Grasp Corpus Complement @ penntags, GCcorpustags D 8 LS, POS, SYM < that is: - LS List item marker (there is no list in the Grasp Corpus) - POS possessive ending (it is preprocessed by Grasp) - SYM special symbol (they are preprocessed by Grasp) -The tags specifically added by the Grasp Corpus Complement @ GCcorpustags, penntags D 8 B, POSS, QUEST, SENT, VIRG < that is: - B border of sentences - POSS mark of the possessive form - QUEST mark of interrogation with a "?" SENT end of sentence or ";" -- VIRG "," occuring within a sentence  3.3. Grammatical units The unit function links a sentence to exactly one grammatical category ( or unit ) in the Murphy ontology. Below, one can see the units associated with the 50 first sentences o the corpus GCunit ê @ Range @ 50 D 8 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 < The number o grammatical units is given by the global variable GCmaxunit GCmaxunit 136 The GCunitde function gives the textual grammatical description o unit 1 < i < GCmaxunit Here are the descriptions associated with unit 1 and 2 GCunitdef @ 1 D GCunitdef @ 2 D Present and past −− Present continuous H I am doing L Present and past −− Present simple H I do L The first part o the description (here above "Present and past --") gives the general grammatical chapters (there are 15 general chapters). The description o the grammatical unit covering a sentence can be shown combining the GCunit and GCunitde func tions. For example, to show the description o the first sentence o the corpus one can type GCunitdef @ GCunit @ 1 DD Present and past −− Present continuous H I am doing L
J-P. Sansonnet
5
grasp.tutorial.nb
Note: a given sentence actually features several grammatical topics but here we can only show the predefined one. The 15 general grammatical chapters are given by the global variable GCglobalunitdefinitions Print @ TableForm @ GCglobalunitdefinitions DD Present and past Present perfect and past Future Modals Conditionals and wish Passive Reported speech Questions and auxiliary words ing and the infinitive Articles and nouns Pronouns and determiners Relatives clauses Adjectives and adverbs Conjunctions and prepositions Prepositions The descriptions o all the units is given by the global variable GCunitdefinitions Print @ TableForm @ GCunitdefinitions DD Present and past −− Present continuous H I am doing L Present and past −− Present simple H I do L Present and past −− Present continuous and present simple H 1 L Present and past −− Present continuous and present simple H 2 L Present and past −− Past simple H I did L Present and past −− Past continuous H I was doing L Present perfect and past −− Present perfect H I have done L H 1 L Present perfect and past −− Present perfect H I have done L H 2 L Present perfect and past −− Present perfect continuous H I have been doing L Present perfect and past −− Present perfect continuous and simple Present perfect and past −− How long have you H been L ...? Present perfect and past −− When...? and How long...? For and Since Present perfect and past −− Present perfect and past H I have done + I did L H 1 L Present perfect and past −− Present perfect and past H I have done + I did L H 2 L Present perfect and past −− Past perfect H I had done L Present perfect and past −− Past perfect continuous H I had been doing L Present perfect and past −− Have and Have got Present perfect and past −− Used to H do L Future −− Present tenses H I am doing ê I do L for the future Future −− I'm going to H do L Future −− Will ê Shall H 1 L Future −− Will ê Shall H 2 L Future −− I will and I'm going to Future −− Will be doing and will have done Future −− When I do ê When I've done ; When and if Modals −− Can, could and H be L able to Modals −− Could H do L and could have H done L Modals −− Must and can't Modals −− May and might H 1 L Modals −− May and might H 2 L Modals −− Must and have to Modals −− Must mustn't needn't Modals −− Should H 1 L Modals −− Should H 2 L Modals −− Had better It's time... Modals −− Cn ê Could ê Would you...? etc. Conditionals and wish −− IF I do... and If I did... Conditionals and wish −− If I knew... I wish I knew... Conditionals and wish −− If I had known... I wish I had known... Conditionals and wish −− Would I wish...would Passive −− Passive H 1 L is done ê was done Passive −− Passive H 2 L be ê been ê being done Passive −− Paqssive H 3 L Passive −− It is said that... He is said to... H be L supposed to... Passive −− Have something done Reported speech −− Reported speech H 1 L he said that... Reported speech −− Reported speech H 2 L Questions and auxiliary words −− Questions H 1 L Questions and auxiliary words −− Questions H 2 L Do you know where...? ê She asked me where... Questions and auxiliary words −− Auxiliary verbs H have ê do ê can L I think so ê I hope so etc. Questions and auxiliary words −− Question tags H do you? isn't it? etc. L ing and the infinitive −− Verb + − ing H enjoy doing ê stop ding etc. L ing and the infinitive −− Verb + to... H decide to do ê forget to do etc. L ing and the infinitive −− Verb + H object L + to... H I want H you L to do etc. L
J-P. Sansonnet
6
grasp.tutorial.nb
ing and the infinitive −− Verb + − ing or to... H 1 L remember ê regret etc. L ing and the infinitive −− Verb + − ing or to... H 2 L H try ê need ê help L ing and the infinitive −− Verb + − ing or to... H 3 L H like ê would like etc. L ing and the infinitive −− Prefer and would rather ing and the infinitive −− Preposition H in ê for ê about etc. L + − ing ing and the infinitive −− Be ê get used to something H I'm used to... L ing and the infinitive −− Verb + preposition + − ing H succeed in ing ê accuse somebody of ing etc. L ing and the infinitive −− Expressions + − ing ing and the infinitive −− To..., For... and so that... H purpose L ing and the infinitive −− Adjective + to... ing and the infinitive −− To... H afraid to do L and preposition + − ing H afraid of ing L ing and the infinitive −− See somebody do and see somebody doing ing and the infinitive −− − ing clauses H Feeling tired, I went to ed early L Articles and nouns −− Countable and uncountable nouns H 1 L Articles and nouns −− Countable and uncountable nouns H 2 L Articles and nouns −− Countable nouns with a ê an and some Articles and nouns −− A ê an and the Articles and nouns −− The H 1 L Articles and nouns −− The H 2 L H School ê the school L Articles and nouns −− The H 3 L H Children ê the children L Articles and nouns −− The H 4 L H The giraffe ê the telephone ê the piano etc. ; the + adjective L Articles and nouns −− Names with and without the H 1 L Articles and nouns −− Names with and without the H 2 L Articles and nouns −− Singlar and plural Articles and nouns −− Noun + noun H a tennis ball ê a headache etc. Articles and nouns −− − 's H the girl's name L and of... H the name of the book L Pronouns and determiners −− A friend of mine ; My own house ; On my own ê by myself Pronouns and determiners −− Myself ê yourself ê themselves etc. Pronouns and determiners −− There... and it... Pronouns and determiners −− Some and any Pronouns and determiners −− No ê none ê any Pronouns and determiners −− Much, many, little, few, a lot, plenty All ê all of ; most ê most of ; no ê none of etc. Pronouns and determiners −− Both ê both of ; neither ê neither of ; either ê either of Pronouns and determiners −− All, every and whole Pronouns and determiners −− Each and every Relatives clauses −− clauses with who ê that ê which Relatives clauses −− clauses with ort without who ê that ê which Relatives clauses −− whose ê whom ê where Relatives clauses −− extra information clauses H 1 L Relatives clauses −− extra information clauses H 2 L Relatives clauses −− − ing and ed clauses H the women talking to Tom, the boy injured in the accident L Adjectives and adverbs −− Adjectives ending in ing and ed H boring ê bored etc. Adjectives and adverbs −− Adjectives: word order H a nice new house L adjectives after verbs H You look tired L Adjectives and adverbs −− Adjectives and adverbs H 1 L quick ê quickly Adjectives and adverbs −− Adjectives and adverbs H 2 L well ê fast ê late ; hard ê hardly Adjectives and adverbs −− So ans such Adjectives and adverbs −− Enough and too Adjectives and adverbs −− Quite and rather Adjectives and adverbs −− Comparison H 1 L cheaper ; more expensive etc. Adjectives and adverbs −− Comparison H 2 L Adjectives and adverbs −− Comparison H 3 L as...as ê than Adjectives and adverbs −− Superlatives the longest ê the most enjoyable etc. Adjectives and adverbs −− Word order H 1 L verb + object ; place and time Adjectives and adverbs −− Word order H 2 L adverbs with the verb Adjectives and adverbs −− Still, yet and already ; Any more ê any longer ê no longer Adjectives and adverbs −− Even Conjuctions and prepositions −− Although ê though ê even though ; In spite of ê despite Conjuctions and prepositions −− In case Conjuctions and prepositions −− Unless ; As long as and provided ê providing Conjuctions and prepositions −− As H reason and time L Conjuctions and prepositions −− Like and as Conjuctions and prepositions −− As if Conjuctions and prepositions −− For during and while Conjuctions and prepositions −− By and until ; By the time Prepositions −− At ê on ê in H time L Prepositions −− On time ê in time ; At the end ê on the end Prepositions −− In ê at ê on H place L H 1 L Prepositions −− In ê at ê on H place L H 2 L Prepositions −− In ê at ê on H place L H 3 L Prepositions −− To ê at ê in ê into Prepositions −− On ê in ê at H other uses L Prepositions −− By Prepositions −− Noun + preposition H reason for, cause of etc. L Prepositions −− Adjective + preposition H 1 L Prepositions −− Adjective + preposition H 2 L Prepositions −− Verb + preposition H 1 L at and to Prepositions −− Verb + preposition H 2 L about ê for ê of ê after Prepositions −− Verb + preposition H 3 L about and of
J-P. Sansonnet
7
grasp.tutorial.nb
Prepositions −− Verb + preposition H 4 L of ê for ê from ê on Prepositions −− Verb + preposition H 5 L in ê into ê with ê to ê on Prepositions −− Phrasal verbs H get up ê brak down ê fill in etc. L 3.4. The GCsee and GCseeunit functions One can make a full visualization o the information associated with a given sentence i with the utility function - GCsee[i] shows sentence i - GCsee[{list of sentence numbers}] shows all the sentences for example to visualize the ten first phrases o the corpus GCsee @ Range @ 10 DD 1 1 8 B,NP,VBZ,IINN,,PPP$$,NNN,,BT < O, VB, B < Asnhneiiissionnivhhieenrrcwaarytowork 2 1 8 B, PP, VBZ, P , N 3 1 8 B, PP, VBZ, VBG, TO, VB, B < she s dr g to work 4 1 8 B, DT, VBZ, IN, PP, VBZ, VBG, RB, B < this means that she is driving now 5 1 8 B, DT, NN, VBZ, RB, VBD, B < the action is not finished 6 1 8 B, PP, VBP, VBG, B < I am driving 7 1 8 B, VB, VBP, RB, VB, RB, JJ, NN, B < please don't make so much noise 8 1 8 B, WRB, VBZ, NP, QUEST, B < where's Margaret? 9 1 8 B, PP, VBZ, VBG, DT, NN, B < she is having a bath 10 1 8 B, VBD, PP, VBP, RB, RB, B < let's go out now Where - the first column is the position of the sentence in the corpus, - the second column is grammatical unit associated with the sentence, - the third column is the tagged form of the sentence (abridged to the tag) - the last column is the string form o the sentence Note: With the option NOTAG in GCsee, the tagged form is removed from the display. The same example without the tags GCsee @ Range @ 10 D , NOTAG D 1 1 Ann is in her car 2 1 she is on her way to work 3 1 she is driving to work 4 1 this means that she is driving now 5 1 the action is not finished 6 1 I am driving 7 1 please don't make so much noise 8 1 where's Margaret? 9 1 she is having a bath 10 1 let's go out now One can see all the sentences covering a given grammatical unit by using the GCseeunit function. GCseeunit[k] shows all the sentences covering the grammatical unit k, 1 < k <GCmaxunit. For example, all the sentences covering the second unit are displayed by GCseeunit @ 2 D Present and past −− Present simple H I do L 30 2 8 B, NP, VBZ, DT, NN, NN, VIRG, CC, RB, PP, VBZ, IN, NN, JJ, B < Alex is a bus driver, but now he is in bed asl 31 2 8 B, PP, VBZ, RB, VBG, DT, NN, B < he is not driving a bus 32 2 8 B, NNS, VBP, IN, NNS, IN, NNS, B < nurses look after patients in hospitals 33 2 8 B, PP, RB, VBP, RB, IN, NNS, B < I usually go away at weekends 34 2 8 B, DT, NN, VBZ, IN, DT, NN, B < the earth goes around the sun 35 2 8 B, PP, VBP, IN, NP, B < I come from Canada 36 2 8 B, WRB, VBP, PP, VB, IN, QUEST, B < where do you come from? 37 2 8 B, MD, PP, VB, DT, NN, QUEST, B < would you like a cigarette? 38 2 8 B, RB, VIRG, NNS, VIRG, PP, VBP, RB, VB, B < no, thanks, I don't smoke 39 2 8 B, WP, VBP, PP, VBP, QUEST, B < what do you do? 440122 8 B, WP, VBZ, PP$, NN, QUEST, B < what's your job? 8 B, WP, VBZ, DT, NN, VBZ, QUEST, B < what does this word means? 42 2 8 B, NN, VBZ, RB, VB, IN, JJ, NNS, B < rice doesn't grow in cold climates 43 2 8 B, PP, VBZ, RB, JJ, B < He is so lazy 44 2 8 B, PP, VBZ, RB, VB, NN, TO, VB, PP, B < he doesn't do anything to help me 45 2 8 B, PP, VBZ, RB, IN, CD, NN, NN, DT, NN, B < she gets up at 8 o'clock every morning 46 2 8 B, WRB, RB, VB, PP, VB, TO, DT, NN, QUEST, B < how often do you go to the dentist? 47 2 8 B, NP, VBZ, RB, VB, NN, RB, RB, B < Jane doesn't drink tea very often 48 2 8 B, IN, NN, NP, RB, VBZ, NN, RB, CC, RB, DT, NN, B < in summer John usually plays tennis once or tw 49 2 8 B, PP, VBP, PP, MD, RB, VB, JJ, B < I promise I won't be late 50 2 8 B, WP, VBP, PP, VBP, PP, VBP, QUEST, B < what do you suggest I do? Note: the left part of the above table has been cut for lack of space (use NOTAG to remove the tags column if necessary). ã NOTAG option With the option NOTAG in GCseeunit, the tagged form is removed from the display Example : random sentences from the corpus. Clear @ randomsentenses D random _ ger ê ; k > 0, tag_:  F sentenses @ k Inte alse D : = GCsee @ Table @ 1 + Random @ Integer, GClength 1 D , 8 k <D , tag D
J-P. Sansonnet
8
grasp.tutorial.nb
The function randomsentenses[k] returns and shows k sentences picked at random within the corpus. One can see below that the sentences, even i artificial, are quite representative o utterances in actual human/human dialogues. randomsentenses @ 20, NOTAG D 5533 106 I didn't want to wake anybody, so I came in as quietly as I could 5337 103 you are quite right 6518 123 our flat is on the second floor of the building 624 16 Jim was on his hands and knees on the floor 3078 62 it's a waste of money buying things you don't need 1869 43 where were you born? 5249 102 it costs too much money 6928 129 I was very pleased with the present you gave me 3679 73 if you want to get a degree, you normally have to study at university 2788 57 Kate doesn't like to wear hats 1887 43 Jill is liked by everybody 5314 103 I quite agree with you 213 8 no, I haven't been to India 3854 78 my trousers are two long 2333 51 yes, I really enjoyed it 5298 103 I quite like tennis but it's not my favourite sport 1193 29 do you know where she is? 1347 32 why did she phone me in the middle of the night? 218 8 Susan really loves that film 3051 61 it was nice of you to come to see me
4. Quantitative analysis of the Grasp Corpus The number o sentences in the corpus is given by the global variable GClength GClength 7557 The number o characters o the corpus is given by the global variable GCcharlength GCcharlength 280464 The total number o words in the corpus is Plus @@ H First @ Last @ # DD & ê @ H GCtag ê @ Range @ GClength DLL 77739 The different words occurring in the corpus (that is not repeated) is a list o string-words given by the global variable GCwords. Length @ GCwords D 3039 Here are some words from the corpus Print @ GCwords P Range @ 1000, 1200 DTD 8 eyesight, face, fact, factory, failed, fails, fair, fall, fallen, falling, false, family, famous, fancy, far, fare, farm, farther, farthest, fast, faster, fastest, fat, father, fault, faults, favour, favourite, february, fed, feel, feeling, feet, fell, felling, felt, fence, Festival, few, field, fiend, fifth, Fifth, fifty, fight, fighting, figures, fill, filled, film, films, finally, financial, find, finding, finds, fine, finger, finish, finished, finishes, finishing, Fiona, fire, firemen, firm, first, fish, fishing, fit, fits, fitted, five, flag, flat, flats, flew, flies, flight, flights, floods, floor, Florence, flower, flowers, flown, flows, fluent, fluently, fly, flying, fog, followed, follows, fond, food, foot, football, footballers, for, foreign, foreigner, Forest, forget, forgetting, forgive, forgot, forgotten, form, fortunately, forward, found, four, france, France, Francisco, Frank, Fred, free, freezing, french, French, frequently, friday, Friday, Fridays, fridge, friend, friendly, friends, frightened, frightening, from, front, frying, full, fully, funniest, funny, furious, furniture, further, furthest, future, galleries, gallery, Gallery, gam, gambling, game, Games, garage, garden, gardener, gardening, Gary, gate, gave, generous, George, german, German, Germany, Gerry, get, gets, getting, ghost, ghosts, giraffe, girl, girlfriend, give, given, gives, giving, glad, glanced, Glasgow, glass, glasses, gloves, go, God, goes, going, gold, golf, gone, good, goodbye, goods, got, government, Graham, Grand, grandfather, grandmother, grass, grateful, great < One can see that the words being as they appear in the sentences, multiple flexions occur. This is the reason why a lemmatization of the words is achieved by TreeTagger. Consequently there are lesser lemma than words. This is con firmed by length o the list o string-lemma given by the global variable GClemma.
J-P. Sansonnet
9
grasp.tutorial.nb
Length @ GClemma D 2259 Here are some lemma from the corpus Print @ GClemma P Range @ 2000, 2200 DTD 8 third, thirsty, thirty, this, Thomas, those, though, thousand, threaten, three, through, throw, Thursday, thus, ticket, tidy, tighten, till, Tim, time, timetable, Tina, tire, tired, to, today, together, Tolstoy, tom, Tom, tomato, tomorrow, tonight, too, top, totally, touch, tour, tourism, tourist, towards, towel, tower, Tower, town, toy, Trafalgar, traffic, train, translate, translator, transport, travel, tray, treasure, treat, treatment, tree, trip, trouble, trouser, true, trumpet, trust, truth, try, Tuesday, Tuesdays, tunnel, turn, TV, twenty, twice, two, type, typical, Ulysses, umbrella, unable, uncle, under, understand, unemployed, unexpected, unexpectedly, unfair, unfortunately, unfriendly, unhappy, Union, unite, United, universe, university, University, unknown, unless, unlock, unnecessarily, unnecessary, unpleasant, unsafe, unsure, untidy, until, unusual, unusually, unwell, up, upset, us, use, useful, useless, usual, usually, valuable, vegetable, vegetarian, Venice, Vera, verb, very, Vicky, Victoria, view, village, violence, violent, violin, visa, visit, visitor, voice, Volga, volleyball, vote, voyage, wait, waiter, waiting, waitress, wake, walk, walking, wall, Wall, wallet, want, war, warm, warn, wash, washing, Washington, washing up, waste, watch, water, wave, way, we, weak, wear, weather, wedding, Wednesday, week, weekend, weight, welcome, well, Well, well balanced, well behaved, well dressed, well informed, well kept, well known, well paid, well qualified, west, West, Western, Westminster, wet, wether, what, whatever, wheel, when, When, whenever, where, whether, which, while, white, White, who, whole < The function GCwordcount, respectively GClemmaCount, returns the number o occurrences o a word (resp. a lemma) in the corpus. GCwordcount @ "God" D 3 GCwordcount @ "Devil" D Devil is not a word of the corpus. Here is a list o the top-twenty most occurring words Take @ Sort @ GCwords, H GCwordcount @ # D > GCwordcount @ #2 DL & D , 20 D 8 , the, I, to, you, not, is, a, ?, ,, it, in, have, was, of, do, we, at, for, she < Here is a list o the top fifty lemma, with their count Note: the sortbycount function sorts a list of lemma with the most occurring first Clear @ sortbycount D sortbycount @ l_List D : = Sort @ l, H GClemmacount @ # D > GClemmacount @ #2 DL & D È Print @ TableForm @ Partition @ # GClemmacount @ # D & ê @ Take @ sortbycount @ GClemma D , 8 2, 50 <D , 8 DDD be 3815 the 2791 I 2686 to 1939 you 1629 not 1561 have 1546 do 1336 ? 1220 , 945 it 870 in 856 of 762 go 714 at 466 for 452 she 448 on 439 get 430 he 400 me 384 that 348 what 347 can 346 and 340 but 321 this 308 will 307 they 282 as 282 time 279 my 271 like 264 so 257 her 251 if 215 car 203 see 202 when 198 know 198 @ card @ → 197 want 190 This is a plot o the counts for all the lemma o the corpus
J-P. Sansonnet
10
a 14 we 6 very there work think
grasp.tutorial.nb
ListPlot @ GClemmacount ê @ Drop @ sortbycount @ GClemma D , 1 D , PlotRange All, PlotStyle RGBColor @ 1, 0, 0 DD ;
3000
2000
1000
500 1000 1500
2000
5. Predefined lexical categories The words o the corpus are divided in the GCmaxunit ã 38 categories o the Penn Treebank by the TreeTagger. Hence we decided to provide a more general categorization with only five categories by clustering of the Penn Treebank categories. 5.1. Grammatical words The GCgramwords variable contains all the lemma (not the words !) viewed as grammatical i.e. associated with the Penn TreeBank tags registered in the GCgramtags GCgramtags B » SYM » TO » SENT » PP » PDT » PP$ » WDT » WP » WR$ » WRB » MD » LS » IN » FW » EX » DT » CD » CC » UH » QUEST » VIRG » POSS Here are the grammatical words sorted by occurrence frequency Print @ sortbycount @ GCgramwords DD Length @ GCgramwords D 8 , the, I, to, you, a, ?, ,, it, in, of, we, at, for, she, on, he, me, that, what, can, and, but, this, will, there, they, as, my, like, so, her, if, when, @ card @ , would, by, with, out, could, your, no, about, ! , all, an, some, any, his, where, why, need, him, who, up, how, from, us, them, than, which, because, should, please, our, two, next, one, every, quite, yes, after, shall, into, must, each, ago, or, before, yet, while, their, these, down, three, such, might, near, mine, ten, since, during, without, unless, neither, both, off, five, until, may, half, four, although, whether, those, worth, another, over, oh, whom, though, either, yourself, yours, themselves, past, myself, along, 8, six, himself, herself, through, once, OK, 5, outside, despite, against, nor, goodbye, between, around, :, theatre, hello, behind, 9, seven, ourselves, ought, eight, whenever, opposite, million, its, 3, 2, i, except, dare, across, 7, yourselves, whatever, under, towards, onto, nearest, hundred, check up, 1, ., Yes, wo, Why, When, Well, twenty, till, thousand, thirty, per, Oh, le, itself, fifty, 4 < 178 5.2. Adverbs The GCadverbwords variable contains all the lemma viewed as adverbs i.e. associated with the Penn TreeBank tags registered in the GCadverbtags GCadverbtags RB » RBR » RBS » RP Here are the adverbs sorted by occurrence frequency
J-P. Sansonnet
11
grasp.tutorial.nb