Some Background on Natural Language Processing – Part 1: English

I thought I would switch topics and start to talk about things like Large Language Models and how they could be applied to things like Geographic Information System (GIS) data.  To do this, I think first it would be good to talk about the basis for some of these tools, such as natural language processing (NLP).  

NLP is the basis for tools such as LLMs and even the ability to extract information from GIS data such as people, places, and things.  As such, I feel it is important to understand NLP as it provides a lot of value to GIS information as well as advanced processing of such information.

Some background about English

Before I dive into how NLP works, I first want to talk about the English language and how it relates to NLP.  This will give some background about why NLP is hard and how far we have come over the years.  So we will have a bit of a history lesson.  I love history and I finally have a reason to blog about it.  Next post we will dig into the background of how NLP used to work and how it works today.

What is the English Language?

First off, English is a horrible language.  It is a West Germanic language that is a part of the Indo-European language family.  It is a mashup of several different languages and continues to evolve almost yearly with new terms being added.  It got its start in what is now the UK prior to the fifth century when it was inhabited Celtic-speaking peoples.  Then came the Romans who tried to enforce Latin, but the natives said no for the most part.

Old English

Next came the Anglo-Saxons who came to Britain from the area around the northwest of modern Germany.  These Anglo-Saxon travelers brought their Ingvaeonic languages to Britain, displaced the Celts, and then started what we now call Old English.  Then while Old English was forming, we throw in some influence form the Vikings who invaded off and on and imparted some of their Norse words into the mix.  Then, to top things off, we had four known variants of the language that developed in Britain.

An example comes from Beowulf what was written around 1000 AD and here is the first line of it:

HWÆT:  WE GAR-DENA IN GEARDAGUM (So. The Spear-Danes in days gone by)

Middle English

Then came the Normans in 1066.  The Normans at the time spoke Anglo-Norman / Anglo-Norman French / Old French.  This became the language of the upper class and the courts of the time.  English then brought in some of the French words to add to the mix.  However, this did simplify the grammar of the language a bit.  This then went on to give us Middle English.  An example of this is found in The Canterbury Tales, where one of my high school English teachers made us memorize the first verse of it and scarred me for life.

Whan that Aprill with his shoures soote (When April with its sweet-smelling showers)

Early Modern English

Then we move on to Early Modern English that started to take root in the 15th to 17th centuries, better known as the Renaissance.  Here we decided to throw in some Latin, Greek, and even more French words for good measure.  The good news is that we started to simplify the grammar even more.  During this time, English went through the Great Vowel Shift.  This change modified how long vowels were pronounced and thus threw in some spelling changes.  The results of this shift have had wide-ranging impacts, including the basis for spelling and pronunciation mismatches in Modern English.  Plus the English and French began their love/hate relationship so it is possible that some anti-French sentiment caused some words to be modified to lose their influence, so to speak.  This was also the language used by Shakespeare who also added words to the language.  Here is an example of what the Lord’s Prayer would have looked like during this period.

Our father which art in heauen, hallowed be thy name. Thy kingdome come. Thy will be done, in earth, as it is in heauen.

Modern and Contemporary English

We finally arrive at Modern English which started in the 17th century and continues on today.  Thanks to the British Empire, English spread globally.  This global contact ended up adding more word to the language.  It went on to become the language of business, science, and diplomacy.

Then you may have heard of a bit of a spat between the British and a bunch of upstart colonies that formed the United States.  As time went on, American English began to differ a bit from British English, mainly in some spellings and different words for things (a car’s hood vs a car’s bonnet).  As the empire fell, we also got various dialects of English including Canadian and Australian English.

A Language that Confuses Native Speakers

This brings us to Contemporary English, a language with a large history of change and influences and words from various cultures.  Unlike Latin-based languages, it really does not follow many rules.  It tends to reuse words with different tenses and even different parts of speech.  The same word can have a lot of different and even non-related meanings.  Consider this, a perfectly valid sentence today:

I banked(1) on the advice from my friend and ended up making a lot of bank(2) by playing the bank(3) of slot machines at the casino.  To keep it safe, I decided to take it to the new bank(4) that they built on the south bank(5) of the river.

As a native American English speaker, this makes me want to cry.  This is why it is so hard for non-native English speakers to learn the language.  Let us look at the abomination that I just wrote.  The word bank is used as a:

  1. Verb: meaning to rely on
  2. Noun (slang really): meaning money
  3. Noun: meaning a row of similar items
  4. Noun: meaning a place to store things for later use
  5. Noun: sloping raised land near a river

Imagine you are someone trying to learn English and you come across that sentence.  Your head would explode.  Think about how hard it would be to figure out what was actually being said.  Note that I could have added another verb use by saying something like “On the way there, I had to bank to the left in my car to get in the proper lane.”  Thankfully I felt bad enough making the above sentence in the first place.

Now imagine a computer trying to parse this.  Remember that a computer is a glorified calculator.  Admittedly that sentence was made up to be a specific edge case, but it illustrates the point that if a human would have trouble understanding something, you can be sure that I computer will too.  And NLP has come a long way and these days would actually be able to make use of that sentence.

Distinguishing the meaning of the same word used in many different contexts has been, and continues to be, one of the primary reasons that getting NLP right has been so hard.  We have ended up with many historic and structural complexities that have led to the language today. 

Historically, most NLP research has been done on the English language due to its dominance in commerce and the link.  NLP into other languages is working to catch up, but for these blog posts I am going to consider that to be out of scope since, well, I do not speak those languages.

And to add in another complication, research and methods of NLP for the English languages do not exactly work well for other languages.  Each language has its own grammar and usage.  Some languages such as Spanish and French (Romance Languages) are similar at least.  But it is basically an apples to oranges comparison to try to apply English NLP principals to something like an Asian language.

Conclusion

I think I will leave things here for today.  Next time I will go into the history of how NLP worked in the past before moving on to how it works today.