Statistical Significance of Word Frequency

I frequently read commentators asserting that a certain word is used more or less frequently in a given book than the rest of the Greek NT, but these generally seem to be 'gut' observations rather than statistically sound conclusions.
Has anyone attempted statistical analysis of word frequencies? I'm thinking a chi squared test or maybe Yates' corrected chi-squared statistic would give statistical rigor to such claims. Is such an analysis possible within Logos or would I need to export frequency lists from Logos into Excel for further analysis?
Comments
-
The new Concordance feature almost gives us a way to calculate the appropriate statistics - it still lacks an option for manual correction of the stemming algorithm and n-gram analysis. However, you can export the Concordance results to Excel.
Orthodox Bishop Alfeyev: "To be a theologian means to have experience of a personal encounter with God through prayer and worship."; Orthodox proverb: "We know where the Church is, we do not know where it is not."
0 -
MJ. Smith said:
The new Concordance feature almost gives us a way to calculate the appropriate statistics - it still lacks an option for manual correction of the stemming algorithm and n-gram analysis. However, you can export the Concordance results to Excel.
I'm not currently a subscriber to Logos Now, but my understanding is that the Concordance feature provides word counts (per word & total) which could then be used to manually compare to another text (eg. entire NT). Is that correct? There is no statistical analysis built in which would highlight words who's frequency is significantly different from the rest of the NT?
0 -
That is correct
Orthodox Bishop Alfeyev: "To be a theologian means to have experience of a personal encounter with God through prayer and worship."; Orthodox proverb: "We know where the Church is, we do not know where it is not."
0 -
Andrew Dreger said:
I frequently read commentators asserting that a certain word is used more or less frequently in a given book than the rest of the Greek NT, but these generally seem to be 'gut' observations rather than statistically sound conclusions.
Has anyone attempted statistical analysis of word frequencies? I'm thinking a chi squared test or maybe Yates' corrected chi-squared statistic would give statistical rigor to such claims. Is such an analysis possible within Logos or would I need to export frequency lists from Logos into Excel for further analysis?
'Statistical significance' is only meaningful when it is relative to a particular hypothesis. When you are talking about statistical analyses of word frequencies, these have been done for ages as a part of textual criticism, particularly with regard to trying to identify particular authors or verify stated authorship. More importantly than simple frequencies are the particular ways that words, lemmas, or roots are used by an author, as those can reveal times, places, and cultural conventions.
A chi-square is a non-parametric test for independence. I'm not sure what you would be attempting to demonstrate by a finding of statistical independence of word frequencies...you'd have to be much more specific about what you are testing. Simply testing for differences in word frequencies in scripture is like testing for differences in color frequencies in a paint set. Significant findings are simply a matter of testing until you get a hit, as significant differences in frequencies are a function of the machine. Stated more formally, if you have a population of size N, the probability of a significant finding approaches 1.0 as the number of tests approach N.
Logos is not a statistical software package, and I hope they never try to make it one. But then, neither is Excel, really. You can do some fairly basic descriptive statistics in Excel, buy you'll need to take your data to a more robust bit of software to do much else (such as SPSS, SAS, JMP, etc.). OTOH, if you want some very nice graphics with the frequencies, Excel is the place to go.
Eating a steady diet of government cheese, and living in a van down by the river.
0 -
Andrew Dreger said:
I frequently read commentators asserting that a certain word is used more or less frequently in a given book than the rest of the Greek NT, but these generally seem to be 'gut' observations rather than statistically sound conclusions.
Doc has suggested that what you may be looking for cannot be established simply by looking at statistics. I doubt most of the comments you've read would be amenable to statistic analysis because of the low frequency of those words in the Bible itself. You do have to have a significant sample size (Doc can correct me) for significance to result from statistical analysis.
You do raise an interesting question, but one I don't think straight statistics are the answer to.
When I get interested in a word and want to see its relative frequency I do a search on the lemma in the entire testament the book is found. After performing the search I have two tools to help answer my question about the relative frequency of use of a word. The first is to use the Graph results option on the Search results panel. I usually would use a bar graph in this.
For example, I am studying 1 Corinthians and came across the word 'wisdom' which translates the Greek σοφία. It occurs 17 times in 1 Corinthians while occurring only 34 times elsewhere in the NT (so 33% of the occurrences are in 1 Cor.). That seems significant. Further investigation of the search results shows that of the 17 times the lemma appears in 1 Corinthians, 16 are in the first three chapters. Further I note that only Luke besides Paul uses the word wisdom more than once in the same chapter. He uses it twice in Luke 2, Luke 11, Acts 6, and Acts 7.
If I were to do a search on the root σοφος I would find additional information. Here I discover 77 results in the NT and 28 of them in 1 Corinthians. Thus when the root is used, Paul's use of it in 1 Corinthians occurs slightly more frequently than just the word σοφία (36% versus 33%).
The second tool I can use is the Analysis view in Search results. This is helpful when searching the root word rather than the lemma. What I observe when I group the words by lemma is the high concentration of the uses of lemmas other than σοφία in 1 Corinthians 1-3 (10 of the 26 other root word uses occur in these 3 chapters). I note these other lemmas are rather scattered throughout the rest of the NT. They are only concentrated in 1 Cor 1-3.
I can't avoid concluding that in the first three chapters Paul has an emphasis on wisdom greater than any other passage in the New Testament. Of course that doesn't answer every question of how significant Paul's use is since he might simply be quoting a passage or two or making a statement repeatedly. More study is needed.
All of this may have occurred to you, but I suspect this (or something like it) is how commentators come to the conclusions you've noted. Perhaps their comments should not be taken as a mathematical or scientific statement, but an observation based on occurrences that corresponds to what a statistical analysis would come up with if such an analysis were possible.
Pastor, North Park Baptist Church
Bridgeport, CT USA
0 -
Mark Smith said:
I suspect this (or something like it) is how commentators come to the conclusions you've noted. Perhaps their comments should not be taken as a mathematical or scientific statement, but an observation based on occurrences that corresponds to what a statistical analysis would come up with if such an analysis were possible.
Hi Mark,
I agree that this is likely how most commentators come to their conclusions. However, I think their conclusions could be more 'robust' if backed by statistical analysis.
Here is a sample of what statistical analysis of word frequency can yield: 5265.Lemma Frequency Analysis for Gospel of John (NA28).xlsx
There are doubtless better ways of doing a statistical analysis on word frequency, but I think this gives an idea of what I am suggesting. I think it would be useful if Logos could highlight the words with statistically significant frequency deviations...And it should be a relatively easy feature to code.
0 -
Thanks for the example. Let me just ask you:
- What words (lemmas) in John would you suggest be marked with significant frequency deviations?
- Would highlighting be sufficient or would there need to be a pop-up or link with the data used to reach this conclusion?
Perhaps some people might be alerted to a word or words that are statistically significant, but there is a difference in what is statistically significant and what is energetically significant. That is where a commentator makes a judgment call based on other factors. I guess I don't see the value of this for my exegesis, but you must so I'd be interested in how you think it could be profitably used for that.
Pastor, North Park Baptist Church
Bridgeport, CT USA
0 -
Mark Smith said:
Thanks for the example. Let me just ask you:
- What words (lemmas) in John would you suggest be marked with significant frequency deviations?
- Would highlighting be sufficient or would there need to be a pop-up or link with the data used to reach this conclusion?
Perhaps some people might be alerted to a word or words that are statistically significant, but there is a difference in what is statistically significant and what is energetically significant. That is where a commentator makes a judgment call based on other factors. I guess I don't see the value of this for my exegesis, but you must so I'd be interested in how you think it could be profitably used for that.
Thanks Mark - Those are excellent questions.
- The words highlighted in green have a statistically significant deviation from the average NT usage (using Yate's corrected chi-squared) with an uncertainty of less than 1% and those in orange have an uncertainty of less than 5%. These could then be highlighted as statistically significant deviations since 5% seems to be a fairly standard cut-off in the literature
- I would suggest that the comparison corpus (entire NT, same genre of writings, other period writings, etc) and maybe even the degree of uncertainty could be selected by the user if appropriate defaults are provided.
- I would never assert that statistically significant means exegetically significant, but in order to be exegetically significant, shouldn't it be statistically significant? I would suggest statistical analysis to curb unwarranted exegetical license. For example, how often do people talk about the significance of words that only appear in a particular book of the Bible? My analysis indicates that such a word must occur at least 3 times in John to be statistically significant (the cut-off would depend on the book's word count)
For what it's worth
0 -
For any reading this thread (or the companion thread) who feel a bit overwhelmed. https://en.wikipedia.org/wiki/Corpus_linguistics will give you a brief introduction to the sorts of things being discussed here. "text analytics" as opposed to "corpus analytics" will take you into the realm of data mining of unstructured text.
Orthodox Bishop Alfeyev: "To be a theologian means to have experience of a personal encounter with God through prayer and worship."; Orthodox proverb: "We know where the Church is, we do not know where it is not."
0 -
Andrew Dreger said:
The words highlighted in green have a statistically significant deviation from the average NT usage (using Yate's corrected chi-squared) with an uncertainty of less than 1% and those in orange have an uncertainty of less than 5%. These could then be highlighted as statistically significant deviations since 5% seems to be a fairly standard cut-off in the literature
That would seem to be quite a few annotations, but I understand why you would set it that way.
Andrew Dreger said:I would suggest that the comparison corpus (entire NT, same genre of writings, other period writings, etc) and maybe even the degree of uncertainty could be selected by the user if appropriate defaults are provided.
If this were done, some controls would be needed. I'd suggest one might like control over the part of speech being marked.
If I understand this correctly, then the extent of the original text and what it is compared to should also be able to be set. For example I might want to know which verbs in 1 & 2 Corinthians are statistically significant compared to the rest of Paul's writings. Or which nouns in the Gospel of John are significant in comparison to the Synoptics. Here I might not be so interested in seeing highlighted words as in seeing lists.
I am still not sure how much use this would be, but then, we already have case frames and I can't see the use of them, so it is good I won't be making the final decision.
Pastor, North Park Baptist Church
Bridgeport, CT USA
0