Download - Peter Grzybek
![Page 1: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/1.jpg)
Peter Grzybek
http://www-gewi.uni-graz.at/quanta
Austrian Research Fund
Project #15485
Von der Ökonomie der Sprache zur Selbst-Regulation kultureller Systeme
Korpuslinguistik vs. Textanalyse
Exakte Literaturwissenschaft:
Zur Prosa Karel Čapeks
Was tun die Wörter im Vers miteinander?
Zur Poesie A.S. Puškins
![Page 2: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/2.jpg)
Peter Grzybek
http://www-gewi.uni-graz.at/quanta
Austrian Research Fund
Project #15485
Korpus-Linguistik
vs.
Text-Analyse
![Page 3: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/3.jpg)
Analysis of Letter Frequencies
Methodological Problems in Former Studies
1. Insufficient Data Distinction
(graphemic and phonematic/phonetic data)
2. Insufficient Control of Data Homogeneity
(text / text segments / text mixtures (corpora)
3. Frequency Models: Continuous vs. Discrete
(a) theoretical entropy, repeat rate
(b) pi = 1
4. Goodness of Fit
Graphics vs. tests, R² vs. ²
![Page 4: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/4.jpg)
Analysis of Letter Frequencies
Methodological Decisions
1. Data Distinction
Graphemic data
2. Control of Data Homogeneity
Text vs. text segments vs. text cumulations vs. text mixtures (corpus)
3. Discrete Frequency Models
Test of relevant models
4. Goodness of Fit
² test C = ² / N (C < 0.02 = * ; C < 0.01 = **)
![Page 5: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/5.jpg)
Analysis of Letter Frequencies
Slavic Alphabets
inventory size
minimal 25 Slovene
maximal 46 Slovak
medium 32/33 Russian(е / ё)
![Page 6: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/6.jpg)
Analysis of Letter Frequencies
Russian
А Б В Г Д Е Ё Ж З И Й К Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
No. Author Text Chapter Abbr. N26 A.S. Puškin Evgenij Onegin ch. 1& 8 ASP-EO1+8 31694
pt. 8 (ch. 18) & pt. 1 (ch. 1)pt. 1 (ch. 1) &pt. 6 (ch. 8)
A.S. Puškin Evgenij Onegin && L.N. Tolstoj Anna KareninaA.S. Puškin Evgenij Onegin && F.M. Dostojevskij Prestuplenie i nakazanieA.S. Puškin Evgenij Onegin && text 24 Text 24L.N. Tolstoj Anna Karenina && text 24 Text 24F.M. Dostojevskij Prestuplenie i nakazanie && text 25 Text 25
34 M. Gor'kij & text 25 Na dne & Text 25 complete texts MG+IN 95312ch. 5, verse 1-5 per ch.epilogue, each alternate linept. 4 (ch. 1-5), every 4th line
38 Complete corpus CC 3328454
7720
28 F.M. Dostojevskij Prestuplenie i nakazanie FMD-PN1+6 29498
27 L.N. Tolstoj Anna Karenina LNT-AK8+1
29 complete texts ASP+LNT 1445733
30 complete texts ASP+FMD 947135
31 complete texts ASP+UR 117311
32 complete texts LNT+UR 1344544
33 complete texts FMD+IN 856596
4323
36 F.M. Dostojevskij Prestuplenie i nakazanie FMD-2 14464
35 Puškin, A.S. Evgenij Onegin ASP1-5
714137 L.N. Tolstoj Anna Karenina LNT-4
No. Author Text Chapter Abbr. N1 A.S. Puškin Evgenij Onegin 1 ASP-EO 1 15830
2 2 ASP-EO 2 11544
3 3 ASP-EO 3 13597
4 4 ASP-EO 4 12475
5 5 ASP-EO 5 12018
6 6 ASP-EO 6 12742
7 7 ASP-EO 7 15180
8 8 ASP-EO 8 15864
9 1-2 ASP-EO 1-2 27374
10 1-3 ASP-EO 1-3 40971
11 1-4 ASP-EO 1-4 53446
12 1-5 ASP-EO 1-5 65464
13 1-6 ASP-EO 1-6 78206
14 1-7 ASP-EO 1-7 93386
15 complete text ASP-EO 1-8 10925016 L.N. Tolstoj Anna Karenina complete text LNT-AK 133648317 Otročestvo complete text LNT-O 11395418 F.M. Dostojevskij Prestuplenie i nakazanie complete text FMD-PN 837885
19 Zapiski iz podpol'ja complete text FMD-ZAP 18824920 A.P. Čechov Čajka complete text* APČ-Č 14573521 Djadja Vanja complete text* APČ-DV 6087122 M. Gor'kij Mat' complete text* MG-MA 433177
23 Na dne complete text MG-ND 7603924 www.rusmet.ru Ural'skij rynok metallov techn. Text UR 806125 www.phyton.ru Instr. sredstva […] techn. Text IN 18711
![Page 7: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/7.jpg)
Zipf (Zeta) distribution
Basic assumption:
r x fr = c fr = c / r
1
1
1, 1,2,3,..., 1,
r a aj
cP r a c
r j
1 11 21 310
200000
400000
600000
800000
1000000
1200000
1400000beobachtet f(i)
Zeta NP(i)
![Page 8: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/8.jpg)
Zipf-Mandelbrot distribution
Basic assumption:
fr = c / (r + b)a
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 320
5000
10000
15000
20000
25000
f(i)
NP(i)
1
1
1, 1,2,3,..., 1, 1,
( ) ( )r a aj
cP r a b c
b r b j
![Page 9: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/9.jpg)
Zipf and Zipf-Mandelbrot Distributions: Goodness of Fit
(38 Russian samples)
1 2 3 4 5 6 7 8 9 1011 1213141516171819202122232425262728293031323334353637380,00
0,05
0,10
0,15
0,20rt. Zeta Zipf-Mandelbrot
![Page 10: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/10.jpg)
1 2 3 4 5 6 7 8 9 1011 1213141516171819202122232425262728293031323334353637380,00
0,05
0,10
0,15
0,20rt. geometric Good1
Geometric Distribution and Good Distribution
1 rrP p q , 1, 2,...,r
r b
aP c r n
r 1
1
jn
bj
ca
j
![Page 11: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/11.jpg)
n = inventory size, x = class
2 parameters: K and M
Negative Hypergeometric Distribution
2
1 1
1
x
M x K M n x
x n xP
K n
n
1 11 21 310
200000
400000
600000
800000
1000000
1200000beobachtet f(i)
neg. hypergeom. NP(i)
Analysis of Russian Letter Frequencies:
Corpus: 37 Texts (ca. 8.5 mio. letters)
![Page 12: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/12.jpg)
Analysis of Russian Letter Frequencies
Comparison of Texts, Text Segments, Text Cumulations, Text Mixtures, and Complete Corpus
1 11 21 310,00
0,02
0,04
0,06
0,08
0,10
1 11 21 310,00
0,50
1,00
1,50
2,00
2,50
3,00
3,50
4,00
Parameter K
Parameter M
Constancy of goodness of fit (C) Constancy of Parameters (K, M)
Negative Hypergeometric Distribution
![Page 13: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/13.jpg)
Analysis of Slovene Letter Frequencies
Corpus: ca. 130.000 letters
Goodness of fit
(C= 0.0094)
Negative Hypergeometric Distribution
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 290
2000
4000
6000
8000
10000
12000
14000
16000
18000
beobachtet
neg.hypergeom.
![Page 14: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/14.jpg)
Analysis of Slovene Letter Frequencies
Comparison of Texts, Text Segments, Text Cumulations, Text Mixtures, and Complete Corpus
Constancy of goodness of fit (C) Constancy of Parameters (K, M)
Negative Hypergeometric Distribution
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200,00
0,50
1,00
1,50
2,00
2,50
3,00
3,50
K
M
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200,00
0,05
0,10
0,15
0,20NHG
![Page 15: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/15.jpg)
Analysis of Slovene Letter and Phoneme Frequencies:
Corpus: ca. 130.000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 290
2000
4000
6000
8000
10000
12000
14000
16000
18000
beobachtet
neg.hypergeom.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 290
2000
4000
6000
8000
10000
12000
14000
16000
beobachtet
neg.hypergeom.
Slovene Letters Slowene Phonemes
![Page 16: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/16.jpg)
First Tentative Results of Slowak Letter Frequencies
1 11 21 31 410
200
400
600
800
1000
1200
1400
beobachtet
neg.hypergeom.
Tasks:
1. Interpretation of Parameters: „foreign letters Q-W-X“ influence inventory size
2. Exploration of Data Basis: Texts, Text Segments, Text cumulations, text mixtures
![Page 17: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/17.jpg)
The Question of Data Homogeneity
![Page 18: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/18.jpg)
“[…] the magnitude of words tends, on the whole, to stand in an inverse (not necessarily proportionate) relationship
to the number of occurrences”
Zipf (1935: 25)
Four major problems in research
![Page 19: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/19.jpg)
What is the direction of dependence:
Does frequency depend on length or vice versa?
What is the unit of measurement:
Is word length measured in letters, phonemes, syllables, morphemes, ...?
What is frequency:
Absolute occurrence or the rank of words, or of word forms?
What is the text basis:
Corpus data, frequency dictionaries, ..., individual texts?
![Page 20: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/20.jpg)
Assuming that
word length is a variable of frequency
Measuring
word length in the number of syllables per word
Analyzing
the absolute occurrence of words
the influence of the text basis shall be tested:
Individual texts vs. text cumulations vs. corpus data
DATA HOMOGENEITYDATA HOMOGENEITY
![Page 21: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/21.jpg)
Intertextual Inhomogene
ity vs.
Intratexual Inhomogene
ity
Combination (“mixture”) of different texts
A ‘text’ in itself does not consist of homogeneous
elements
Different Languages Different Languages Different Authors Different Authors Different Different Text TypesText Types
• complete novel, composed of chapters
• complete book of a novel, consisting of several chapters
• individual chapters
• dialogical vs. narrative sequences within a text
![Page 22: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/22.jpg)
x
dxb
y
dy
1
1 baxy
1/ 1( 1) : , B bx A y with A a B
b
![Page 23: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/23.jpg)
RussianAnna Karenina (ch. 1)
xfrequency
ylength
123456789
1013192037
2.922.142.051.501.331.501.671.001.001.001.001.001.001.00
3.032.041.701.531.431.361.311.271.241.221.171.121.111.06
a = 2.0261, b = 0.9660R² = 0.88, N = 397
y
1 11 21 31
Frequency
0
0,5
1
1,5
2
2,5
3
3,5
Mea
n W
ord
Len g
th
observed
theoretical
![Page 24: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/24.jpg)
Text Language N R² a b
Anna Karenina (I,1) Russian 397 0.88 2,03 0,97Evgenij Onegin (I) Russian 1871 0.96 1,70 0,79Na badnjak Croatian 2450 0.93 1,95 0,51Zářivé hlubiny Czech 1363 0.94 1,76 0,59Hiša M.P. (I) Slovenian 1147 0.84 1,80 0,40Zakliata panna Slovak 926 0.88 1,48 0,69Hänsel und Gretel German 803 0.87 1,16 0,51Fairy Tale by Móra Hungarian 234 0.96 1,57 0,84Di lembung kuring Sundanese 431 0.91 1,86 0,51Burung api Indonesian 1393 0.92 2,44 0,26 Portrait of a Lady (I) English 1104 0.89 1,23 0,83
0.84 R² 0.96
![Page 25: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/25.jpg)
1 2 3 4 5 6 7 8 9 101
1,5
2
2,5
3
3,5
41 2 3 4 5 6 7 8 9 10 11
The course of the theoretical curves
![Page 26: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/26.jpg)
The relationship between parameters a and b
1 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 2 2,1 2,2 2,3 2,4 2,5
Parameter a
0
0,2
0,4
0,6
0,8
1
1,2P
aram
eter
b
![Page 27: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/27.jpg)
The relationship between text length (N) and parameter a
0 50 100 150 200 250
Textlänge (N)
0
0,5
1
1,5
2
2,5
3P
aram
eter
a
![Page 28: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/28.jpg)
Obvious data inhomogeneity
1. Texts from different languages, authors, and various text types
2. Violation of the ceteris paribus condition
Ergo: The data in this mixture are not adequate
for testing the hypothesis at stake
![Page 29: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/29.jpg)
Lev N. Tolstoj: Anna Karenina Chap. I,1 vs. I (34 chapters)
1 11 21 31
Wortfrequenz
0,00
1,00
2,00
3,00
4,00
Mitt
lere
Wor
t l än g
e
A.K. (I,1) - emp.
A.K. (I,1,) - th.
A.K. (I) - emp.
A.K. (I) - th.
N(Types
)C a b
AK (I, 1)
397 0.97 2.03 0.97
AK (I) 8661 0.86 2,60 0.27
![Page 30: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/30.jpg)
Henry James: Portrait of a Lady Chap. 1 vs. novel (52 chapters)
N(Types
)C a b
I 1104 0.89 1.23 0.83
I-52 10727 0.58 1,84 0.27
1 2 3 4 5 6 7 8,5 11 14,5 19,83 27,5 73,430,00
0,50
1,00
1,50
2,00
2,50beobachtet
theoretisch
1 11 21 31 41,5 80,5 91 154 306,8 53970,00
0,50
1,00
1,50
2,00
2,50
3,00
beobachtet
theoretisch
![Page 31: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/31.jpg)
N(Type
s)C a b
narrative 1913 0.91 1.93 0.54
dialogues 673 0.96 1.61 0.84
Ks.Š. Gjalski: Na badnjak Narrative vs. dialogical sequences
1 11 21 31 41 51 61 71 81 910,00
0,50
1,00
1,50
2,00
2,50
3,00
3,50
Narration
Dialogue
![Page 32: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/32.jpg)
Evgenij OneginText cumulation (I – VIII)
Chapter NTypes
MTokens
a b R2
II+III-IIII-IVI-VI-VII-VIItext
(I-VIII)
18712918395148515737650974768329
3209554683591093613376159781906122482
1,701,841,921,971,951,972,032,05
0,790,690,570,530,480,520,430,40
0.960.880.880.920.940.940.860.88
1 2 3 4 5 6 7 8 9 101,00
1,50
2,00
2,50
3,00
3,501 1-2 1-3 1-4 1-5 1-6 1-7 ges
Results of fitting y = ax^-b + 1 to the cumulative text of Evgenij Onegin
![Page 33: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/33.jpg)
Evgenij Onegin – text cumulation (chap. I – VIII)
Fitting y = ax^-b R² = 0.92
1,6 1,7 1,8 1,9 2 2,1
Parameter a
0,00
0,20
0,40
0,60
0,80
1,00
Par
amet
er b
1,6 1,7 1,8 1,9 2 2,1
Parameter a
0,00
0,20
0,40
0,60
0,80
1,00
Par
amet
er b
Dependence of parameter b on parameter a
![Page 34: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/34.jpg)
Evgenij Onegin Text cumulation (I – VIII)
Dependence of a on Text Length (N):
a = 0.6493N0.1286 (R² = 0.96 )
1 101 201 301 401 501 601 701 801 901
Textlänge (Wortformen-Types)
1,50
1,60
1,70
1,80
1,90
2,00
2,10P
aram
eter
a
![Page 35: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/35.jpg)
Summary & ResultsSummary & Results (I)(I)
Data corroborate hypothesis: ( ) 1bL f F y a x
There is a specific interrelation of parameters:
a = f (N) b = g(a)
b = h(N)
f, g, h functions of the same type
![Page 36: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/36.jpg)
Summary & ResultsSummary & Results (II)(II)
1. Homogeneous texts do not interfere with linguistic laws, inhomogeneous texts can distort the textual reality.
2. Text mixtures can evoke phenomena which do not exist as such in individual texts
3. Short texts do not allow a property to take appropriate shape; long texts (and corpora) contain mixed generating regimes superposing different layers, what may lead to “artificial” phenomena.
4. With an increase of text size the resulting curve of the frequency-length relationship is shifted upwards; this is caused by the fact that the number of words occurring only once increase up to a certain text length. If this assumption is correct, then b converges to zero, yielding the limit y = a.
![Page 37: Peter Grzybek](https://reader035.vdokument.com/reader035/viewer/2022062408/568136be550346895d9e5c21/html5/thumbnails/37.jpg)
F I N I S F I N I S