Some Background on Natural Language Processing – Part 1: English

Posted on October 30, 2024 by bigbubba

I thought I would switch topics and start to talk about things like Large Language Models and how they could be applied to things like Geographic Information System (GIS) data. To do this, I think first it would be good to talk about the basis for some of these tools, such as natural language processing (NLP).

NLP is the basis for tools such as LLMs and even the ability to extract information from GIS data such as people, places, and things. As such, I feel it is important to understand NLP as it provides a lot of value to GIS information as well as advanced processing of such information.

Some background about English

Before I dive into how NLP works, I first want to talk about the English language and how it relates to NLP. This will give some background about why NLP is hard and how far we have come over the years. So we will have a bit of a history lesson. I love history and I finally have a reason to blog about it. Next post we will dig into the background of how NLP used to work and how it works today.

What is the English Language?

First off, English is a horrible language. It is a West Germanic language that is a part of the Indo-European language family. It is a mashup of several different languages and continues to evolve almost yearly with new terms being added. It got its start in what is now the UK prior to the fifth century when it was inhabited Celtic-speaking peoples. Then came the Romans who tried to enforce Latin, but the natives said no for the most part.

Old English

Next came the Anglo-Saxons who came to Britain from the area around the northwest of modern Germany. These Anglo-Saxon travelers brought their Ingvaeonic languages to Britain, displaced the Celts, and then started what we now call Old English. Then while Old English was forming, we throw in some influence form the Vikings who invaded off and on and imparted some of their Norse words into the mix. Then, to top things off, we had four known variants of the language that developed in Britain.

An example comes from Beowulf what was written around 1000 AD and here is the first line of it:

HWÆT: WE GAR-DENA IN GEARDAGUM (So. The Spear-Danes in days gone by)

Middle English

Then came the Normans in 1066. The Normans at the time spoke Anglo-Norman / Anglo-Norman French / Old French. This became the language of the upper class and the courts of the time. English then brought in some of the French words to add to the mix. However, this did simplify the grammar of the language a bit. This then went on to give us Middle English. An example of this is found in The Canterbury Tales, where one of my high school English teachers made us memorize the first verse of it and scarred me for life.

Whan that Aprill with his shoures soote (When April with its sweet-smelling showers)

Early Modern English

Then we move on to Early Modern English that started to take root in the 15th to 17th centuries, better known as the Renaissance. Here we decided to throw in some Latin, Greek, and even more French words for good measure. The good news is that we started to simplify the grammar even more. During this time, English went through the Great Vowel Shift. This change modified how long vowels were pronounced and thus threw in some spelling changes. The results of this shift have had wide-ranging impacts, including the basis for spelling and pronunciation mismatches in Modern English. Plus the English and French began their love/hate relationship so it is possible that some anti-French sentiment caused some words to be modified to lose their influence, so to speak. This was also the language used by Shakespeare who also added words to the language. Here is an example of what the Lord’s Prayer would have looked like during this period.

Our father which art in heauen, hallowed be thy name. Thy kingdome come. Thy will be done, in earth, as it is in heauen.

Modern and Contemporary English

We finally arrive at Modern English which started in the 17th century and continues on today. Thanks to the British Empire, English spread globally. This global contact ended up adding more word to the language. It went on to become the language of business, science, and diplomacy.

Then you may have heard of a bit of a spat between the British and a bunch of upstart colonies that formed the United States. As time went on, American English began to differ a bit from British English, mainly in some spellings and different words for things (a car’s hood vs a car’s bonnet). As the empire fell, we also got various dialects of English including Canadian and Australian English.

A Language that Confuses Native Speakers

This brings us to Contemporary English, a language with a large history of change and influences and words from various cultures. Unlike Latin-based languages, it really does not follow many rules. It tends to reuse words with different tenses and even different parts of speech. The same word can have a lot of different and even non-related meanings. Consider this, a perfectly valid sentence today:

I banked(1) on the advice from my friend and ended up making a lot of bank(2) by playing the bank(3) of slot machines at the casino. To keep it safe, I decided to take it to the new bank(4) that they built on the south bank(5) of the river.

As a native American English speaker, this makes me want to cry. This is why it is so hard for non-native English speakers to learn the language. Let us look at the abomination that I just wrote. The word bank is used as a:

Verb: meaning to rely on
Noun (slang really): meaning money
Noun: meaning a row of similar items
Noun: meaning a place to store things for later use
Noun: sloping raised land near a river

Imagine you are someone trying to learn English and you come across that sentence. Your head would explode. Think about how hard it would be to figure out what was actually being said. Note that I could have added another verb use by saying something like “On the way there, I had to bank to the left in my car to get in the proper lane.” Thankfully I felt bad enough making the above sentence in the first place.

Now imagine a computer trying to parse this. Remember that a computer is a glorified calculator. Admittedly that sentence was made up to be a specific edge case, but it illustrates the point that if a human would have trouble understanding something, you can be sure that I computer will too. And NLP has come a long way and these days would actually be able to make use of that sentence.

Distinguishing the meaning of the same word used in many different contexts has been, and continues to be, one of the primary reasons that getting NLP right has been so hard. We have ended up with many historic and structural complexities that have led to the language today.

Historically, most NLP research has been done on the English language due to its dominance in commerce and the link. NLP into other languages is working to catch up, but for these blog posts I am going to consider that to be out of scope since, well, I do not speak those languages.

And to add in another complication, research and methods of NLP for the English languages do not exactly work well for other languages. Each language has its own grammar and usage. Some languages such as Spanish and French (Romance Languages) are similar at least. But it is basically an apples to oranges comparison to try to apply English NLP principals to something like an Asian language.

Conclusion

I think I will leave things here for today. Next time I will go into the history of how NLP worked in the past before moving on to how it works today.

Stupid LIDAR Tricks Finale

Posted on October 10, 2024 by bigbubba

I thought I’d finally wrap this up so I can move on to other things. Since I’ve last posted, I replaced my Jankinator 1000 (nVidia Tesla P40 with a water cooler) and my nVidia RTX 2060 with an Intel Arc A770. It has 16Gb of VRAM and is actually a pretty fast GPU on my Linux box.

So far I’ve had pretty good luck getting things like TensorFlow and PyTorch working on it, as mentioned in a previous blog post. The only thing so far that I haven’t gotten to work 100% is Facebook’s Segment Anything Model 2 (SAM2) (which of course is what I mentioned in the last post I wanted to try to use with the LIDAR GeoTIFF). Basically now down to running out of VRAM although I’m not sure exactly why since I’ve tweaked settings for OpenCL memory allocation on Linux, etc. I’ve finally given up on that one for now and decided to just use the CPU for processing.

As a refresher, here is the LIDAR GeoTIFF I have been using.

SAM2 has the ability to automatically generate masks on input images. This was of interest to me since I wanted to test it to try to automatically identify areas of interest from LIDAR. Fortunately, the GitHub repo for SAM2 has a Jupyter notebook that made it easy to run some experiments.

With all of the default parameters, we can see that it only identified the lower right corner of the image. I tweaked a few of the settings and came up with this version:

The second image does show some areas highlighted. It got a grouping of row houses on the bottom of the image. It also found a few single family houses as well. However, it also flagged areas where there is really nothing of interest.

Finale

What have we learned from all of this? Well, LIDAR is hard. Automatically finding features of interest in LIDAR is also hard. We can get some decent results using image processing and/or deep learning techniques, but as with anything in the field, we are no where near 100%.

Previously I posted about training a custom RCNN to identify features of interest from GeoTIFF LIDAR. It was somewhat successful, although, as I said, it needed a lot more training data than I had available.

I do think that over time, people will develop models that do a good job of finding certain areas of interest in LIDAR. Some fields already have software to find specific features in LIDAR, especially in fields such as archaeology. And this is probably how things will continue for a while. We probably will not have a generalized “find everything of interest in this LIDAR image” model or software for a long time. However, it is possible to train a model to identify specific areas.

I am also working with LIDAR point data as well versus GeoTIFF versions. Point data is a lot different beast in that it has various classifications of the points after it has been processed. You can then do things such as extract tree canopy points or bare ground points. Conversion to a raster necessarily looses information as things have to be interpolated and reduced in order to produce a raster. I’ll post some things here in the future as while point cloud data can be harder to work with, I think the results are better than what can be obtained via rasters.

Saying Goodbye to my Census Tiger Data

Posted on October 6, 2024 by bigbubba

For a long time now I’ve maintained a version of the Public Domain Census Tiger Data converted from county-level to state-level. Over the years I’ve actually had a lot of those shape files downloaded so I’m glad they were useful to some people!

However, the Census is now putting out geopackages of there data at both the national and state levels. I’d also like to thank the US Census for doing that as I think state-level data is way more usable than county level!

As a result, I think there’s not really a reason for me to host the state-level data any more. I’ll still host the script to download and create a PostgreSQL/PostGIS database at my github repo, and might even get around to adding scripts to automatically process the geopackages.

So, so long my state-level Census data and thanks for all the fish!

Brian's Geek Blog

My Home on the Internet

Monthly Archives: October 2024