Out of curiosity I decided to have a look at the text of the Bible, and see how the occurences of certain words and names are distributed among the books/chapters.

This is arguably the most basic type of language analysis one can try, the biggest limitation being that one must already be familiar with the text to some extent, or at least know quite well what they're looking for. Still, after a bit of exploration, some interesting patterns can be spotted. In this post I will describe what I tried to do, and present the most interesting results.

Source text

I chose to work with the Bible because it has several advantageous properties. First, there are multiple digitalized English translations available in the public domain. Second, it has some chronological structure (it was written by many people over the ages, and the books are in more-or-less chronological order), which makes it possible to observe how the language and different concepts expressed in it evolved in time. Last but not least, I'm already familiar with it, so I had decent understanding of the content, and also some expectations that I could verify.

After some research, I decided to go with King James Bible (KJV), the traditional English translation, because (1) it's one of the most common versions, (2) it is available in the most machine-readable format I could find. KJV was translated at the beginning of the 17th century for the Church of England (which also means it is not an accepted version for e.g. the Catholic Church). It consists of 39 books of the Old Testament, 14 books of the Apocrypha, and 27 books of the New Testament.

Because the Apocrypha are in their nature somewhat separate from the canon, I included them at the end (while traditionally they are more strongly associated with the Old Testament). Keep in mind that the Apocrypha do not include what is often referred to as the New Testament Apocrypha – the out-of-canon gospels like the Gospel of Thomas, etc.

I wrote a simple parser for the text, to be able to analyze it by books, chapters or paragraphs.

The parsed structure

Books 1-39 belong to the Old Testament (OT), 40-66 to the New Testament (NT), and 67-81 to the Apocrypha. Books 40-43 are the Gospels.

Word occurences

I wrote a set of simple functions that make it easy to count and plot different words and word patterns across the Bible, with chosen granularity (book/chapter/paragraph). Even with this basic tool one can find some interesting things in the text, and I'm going to present a couple of things that I tried.

Names of God

In the Bible, God is referred to by multiple different names, some of them more popular in the OT or in the NT. One can also see that the name Christ was mentioned much more in the later parts of the NT than in the Gospel.


Sin and violence

Sin is one of the large topics present in the Bible. It is often related to violence, though other kinds of sins are often mentioned, too. Furthermore, violence, including warfare, is not always considered a sin – especially in the OT – and certain kinds of violence are commited by the God himself. Moreover, words like "fight" or "battle" can of course refer to other things which are not necessarily violence, like e.g. spiritual or moral struggle.



We can easily see there are some points with outstanding accumulation of vocabulary related to sin and violence. In general, the OT contains much more of it, but the Gospel (40-43) also stands out, especially when it comes to sins. There are some books which score high in both categories, e.g. 19, 23, 24, 26 – the books of Psalms, Isaiah, Jeremiah, and Ezekiel.

For example, The Book of Isaiah includes prophecies about the fate of Israel, about God punishing the wicked, and helping the faithful to defeat their enemies:

{14:19} But thou art cast out of thy grave like an abominable branch, [and as]
the raiment of those that are slain, thrust through with a sword, that
go down to the stones of the pit; as a carcase trodden under feet.
{14:20} Thou shalt not be joined with them in burial, because thou hast
destroyed thy land, [and] slain thy people: the seed of evildoers shall
never be renowned. {14:21} Prepare slaughter for his children for the
iniquity of their fathers; that they do not rise, nor possess the land,
nor fill the face of the world with cities. {14:22} For I will rise up
against them, saith the LORD of hosts, and cut off from Babylon the
name, and remnant, and son, and nephew, saith the LORD.

As a sidenote, The Book of Ezekiel is the one famously (mis)quoted by Jules Winnfield in Pulp Fiction:

{25:17} And I will execute great vengeance upon them with furious
rebukes; and they shall know that I [am] the LORD, when I shall lay my
vengeance upon them.

However, just looking at the number of occurences of certain words might lead us to hasty conclusions. The books of the Bible differ quite a lot in length, and that should be taken into account, too. For example, the NT contains many books which are very short, especially as compared to the early books of the OT.


If we look at the relative frequency of the terms, instead of their counts, the picture is a bit different:



The patterns we can spot now are different. The overall picture is less obvious, and more difficult to intepret. We can still see, for instance, that the early OT contained a lot of violence, while being relatively less concerned with different kinds of sins. Later on, the landscape becomes more diversified, which may point to growing interest in various moral problems.

The books which I previously mentioned as standing out in terms of high amount of vocabulary related to sin and violence (19, 23, 24, 26), turn out to be also very long. After taking the total length into account, they still score very high, but there are some new peaks visible now – like The Book of Lamentations (25).

Men and women

It is a well-known fact that in the early days of humanity, men mated with each other and gave birth to children. Women giving birth, or generally existing, is a slightly later idea.

I tried to see how often women are mentioned in the Bible, as compared to men, and how the situation evolved in time. Unsurprisingly, women have a significantly lower presence in the book.


Interestingly, the number of mentions seem to decrease in time – in the older books, the vocabulary related to women is much more frequent, especially as compared to the men-related vocabulary. In the later parts of the NT (after the Gospels), it is virtually non-existent.


Again, looking at relative frequencies nullifies the previous hypothesis of decreasing trend, or at least makes it much less visible. The difference between male- and female-related vocabulary, however, is still very obvious – with two notable exceptions. Those are the The Book of Ruth, which tells the story of Ruth, the great-grandmother of David; and The Book of Susanna – treated as an apocryphe in KJV, but included in the canonical Book of Daniel by the Catholics – telling about Susanna, who was harassed, blackmailed, and falsely accused by two lustful elders, and eventually saved by Daniel.

As can be seen in the graphs, I only searched for general gender-related words. It would be also interesting to have a look at the proper names. Unfortunately, I couldn't find any complete list of characters from the Bible, especially separate ones for men and women. There is, however, a list of female characters on Wikipedia. While I was not able to find a list of male characters that would be comparable, only looking and female characters can be also illuminating to some extent.


Notably, the largest number of occurences by far can be found in the book of Genesis, which mentions women such as Sarah, Rebekah, Rachel or Leah.

I suppose the Wikipedia list is not exhaustive. Moreover, some of the names are actually ambiguous: for instance, there are 5 biblical characters named Abihail, three of them male and two of them female. I also had to remove some names that gave a lot of false positives (e.g. Lo, Me, and Noah – the daughter of Zelophehad, being confused with the more well-known Noah who was a man). That being said, this example is certainly an outlier, and I think the graph is mostly reliable.

In order to make a better comparison, it would be necessary to manually prepare a list of names, divided by gender. Given that many different lists of characters are available on the internet, it probably wouldn't be as tedious a task as it might seem, and perhaps I will get back to this topic at some point if I have more time.

Sentiment analysis

This is not exactly a kind of a task that sentiment analysis was invented for, nevertheless, I thought it might be interesting to run some simple SA on the Bible and see what we can get.

For this task, I used the most common open source SA tool included in NLTK. Again, this tool was created to analyze modern English language as used on the Internet, and certainly not the Bible language (especially the old translation). However, despite considerable differences in grammar, a large chunk of vocabulary is actually not that different, so I expected to get somewhat meaningful results. I decided to check the emotional valence associated with different keywords. High valence means more positive connotations, while low valence – negative connotations.

I prepared the text by tokenizing and lemmatizing it, removing the punctuation and common stopwords. Then, for each book and a given keyword, I extracted only the sentences mentioning the keyword, and calculated their mean compound sentiment polarity.

Below you can see the average sentiment associated with different names of God across the Bible.


Understandably, Jesus Christ was not mentioned in the Old Testament, thus the sentiment score is 0 before the New Testament starts. Most of the other zeros are also the result of the lack of mentions.

Two lowest valleys in the plot are The Book of Lamentations (25) telling about the destruction of Jerusalem by Babylon, and The Book of Nahum (34), about the conquest of Nineveh by the Medes and Babylonians. They contain a lot of violence, and portray God as a jealous and vengeful entity. On the contrary, God, under all names, is consistently talked about in a relatively positive context in the later parts of the New Testament (not so much in the Gospel; 40-43); and less consistently in the apocryphs.

Other ideas

As I mentioned in the beginning, the analysis presented in this post is pretty basic, and a lot more could be done to make it more insightful. Perhaps the most challenging part is the very specific language of the Bible. Building a language model based (at least partly) on this particular text would make it possible to try, for instance, some semantic clustering. Having that, we could go a bit beyond the approach of picking arbitrary keywords, which would definitely make the analysis more interesting, and could help us reveal a bit more about the relationship between different words in the text.

I'm planning to get back to this topic at some point, and also try to explore some other sacred text, depending on their availability.

All the code that I used for this analysis – and some other things, too – is publicly available on my gitlab.