Jul-21-2020, 05:16 PM
I want to calculate the lexical diversity average over the course of a text. The window word length is 1000, whereas the overlap between constrained text increments is 500, e.g. [0:999], [500:1499], [1000:1999], etc. Below, first off, the function to calculate the slice total for any given text is defined; that outcome will be used to calculate the average lexical diversity.
def slice_total(text):
return len(text) / 500 The objective now is to assess the lexical diversity of these above-specified increments (i.e. 1000 window word lengths). To constrain once, twice, or three times individual text length is with ease achievable via [#:#]. The puzzle I have yet to piece together is how to constrain the text assessment by 1000 window word length increments without writing them all out. Text 1 text length is, for instance, over 200,000 tokens; to write out all increments, e.g. [0:999], [500:1499], [1000:1999], [1500: 2499], etc. is an onerous, inefficient task. Below is the code I have so far:>>> def slice_total(text):
return len(text) / 500
>>> print(slice_total(text1))
[output]521.638[/output]
>>> print(slice_total(text2))
[output]283.152[/output]
>>> print(slice_total(text3))
[output]89.528[/output]
>>> print(slice_total(text4))
[output]299.594[/output]
>>> print(slice_total(text5))
[output]90.02[/output]
>>> print(slice_total(text6))
[output]33.934[/output]
>>> print(slice_total(text7))
[output]201.352[/output]
>>> print(slice_total(text8))
[output]9.734[/output]
>>> print(slice_total(text9))
[output]138.426[/output]I attempt to reach the correct operation/outcome by the following code. It includes 2 defined functions (i.e. lexical_diversity and lexical_diversity_average) together with the application of 1000 window word constraints. The last 2 operations result in the same outcome, which may confirm the 1000 window word constraint with 500 word overlap shared between windows. If that is the case, does the operation with '+1000' account for the entire text? My intuition is no insofar as we expect the outcome to decrease as more text is included. Does the '+1000' constraint impact the operation outcome at all?>>> def lexical_diversity(text):
return len(set(text))/len(text)
>>> lexical_diversity(text1[0:999])
0.46146146146146144
>>> lexical_diversity(text1[0:999+1000])
0.40370185092546274
>>> def lexical_diversity_average(text):
return len(text)/len(set(text))
>>> lexical_diversity_average(text1[0:999])
[output]2.1670281995661607[/output]
>>> lexical_diversity_average(text1[0:999][500:1499])
[output]1.9192307692307693[/output]
>>> lexical_diversity_average(text1[0:999][500:1499][1000:1999])
[error]Traceback (most recent call last):
File "<pyshell#31>", line 1, in <module>
lexical_diversity_average(text1[0:999][500:1499][1000:1999])
File "<pyshell#23>", line 2, in lexical_diversity_average
return len(text)/len(set(text))
ZeroDivisionError: division by zero[/error]
>>> lexical_diversity_average(text1[0:999+1000])
[output]2.477075588599752 [/output]
>>> lexical_diversity_average(text1[0:999][500:1499])
[output]1.9192307692307693[/output]
>>> lexical_diversity_average(text1[0:999][500:1499+1000])
[output]1.9192307692307693[/output]
