A Short Guide to NLP – Part 3: How We Do NLP Today

Posted on April 27, 2025 by bigbubba

OK I will admit it took me longer than I had planned to finish this up. Life got in the way. But now I think is a good time to finish up the series and move on to another.

In the first parts of this series, we looked at why English is such a challenging language for natural language processing (NLP), and how early methods like rules-based systems and Bag of Words models approached the problem. But like everything else, access to power GPU’s and AI has had a big impact on modern NLP techniques.

Today, NLP has evolved dramatically, powered by new methods that are far more powerful — and much better at handling the complexity and ambiguity of human language.

Let us look at some of the modern techniques that have reshaped NLP in recent years.

Word Embeddings: Giving Words Meaning

One major leap forward came with word embeddings — ways of representing words as vectors in a multi-dimensional space, where words with similar meanings are close together.

Unlike older methods that treated words as isolated tokens, embeddings like Word2Vec, GloVe, and FastText actually learned that words like “king” and “queen” are related, or that “Paris” and “France” have a strong connection.

These embeddings helped NLP systems recognize relationships and similarities between words — even when they were not identical — which made downstream tasks like translation, search, and classification much more accurate.

Let us dig a little deeper into how these things work by taking a look at Word2Vec. Word2Vec generates a vector representation of words in a multi-dimensional space. In this space, similar words end up being “closer” to one another. The idea behind it is that words that appear in similar contexts tend to have similar meanings.

Word2Vec actually generates a shallow neural network to learn the relationships between words. It typically has a dense hidden layer that takes generated embeddings from the inputs to predict the outputs. It does this in two different ways.

The first way is the Skip-Gram Model and is the most common method of generating the network. We select a word in the middle of the sentence and train the model to predict the nearby words within a certain window size. These word pairs (such as cat and furry) are fed into the network for training. The model learns to predict context words given an input word. A very simplified example of the network is given below.

Input Layer (one-hot encoding) : (cat, furry)
          |
Hidden Layer (dense layer)
          |
Output Layer (Softmax layer that outputs probabilities for words being a nearby word)

The second method is the Continuous Bag of Words model. Here the model is trained by giving it context words and it then tries to predict the target word. It can be thought of as a reverse of the previous technique. It uses a fixed-sized window of context words around the target word. So if our target word in a sentence is cat, we could have context words of furry and purrs. Another of my masterful graphic arts shows how this works below.

Input Layer (words are one-hot encoded, averaged into an input vector, and the vectors summed depending on the implementation)
          |
Hidden Layer (dense layer)
          | 

Output Layer (Softmax layer that outputs probabilities of a single word being the target word)

But traditional embeddings had one big weakness: each word only had one vector, no matter how it was used. “Bank” always meant the same thing, whether you were talking about rivers or money.

That led to the next big innovation: contextual embeddings.

Contextual Embeddings: Understanding Words in Context

With models like ELMo and BERT, NLP systems moved beyond static word meanings. Now, the context of a word — the words around it — could change its meaning. If you wrote “he deposited cash at the bank” versus “they sat by the river bank,” modern models could understand that bank means two very different things.

This made a huge difference in tasks like question answering, translation, and search, where understanding the nuance of a sentence is critical.

Transformers: The Engine Behind Modern NLP

All of this was made possible by a groundbreaking architecture introduced in 2017: the Transformer.

Transformers, introduced in the paper Attention is All You Need, replaced older models like RNNs and LSTMs by doing something surprisingly simple: instead of reading words one by one, they looked at the entire sentence (or even paragraph) all at once.

At the heart of transformers is the attention mechanism, which lets the model figure out which words in a sentence are most important when trying to understand a given word.

This means a model can understand relationships across an entire sentence — or even multiple sentences — no matter how far apart the words are.

For an example, we will look at how BERT (Bi-directional Encoder Representations from Transformers) works. BERT is an encoder only transformer architecture that consists of four main modules:

The Tokenizer Module converts the words in a sentence into a series of tokens.
The Embeddings Module converts the tokens into embeddings.
The Encoder Module is a stack of Transformer blocks that use self-attention without causal masking. Self-attention here basically means the blocks determine the relative importance of component in a sequence relative to the other components of the sentence. This lets the model learn the relationships between words no matter where they appear in a sentence.
The Task-Head Module uses the final embeddings to predict outputs, such as masked words or next sentence predictions, typically through a classification layer.

BERT processes the entire sentence at once, verses other methods that look at text sequentially. As a whole model, BERT is trained and fine-tuned using two unsupervised-learning tasks.

Masked Language Modeling (MLM) masks random words in a sentence and the model learns how to predict the masked words based on the context provided by the other words in the sentence.
Next Sentence Prediction (NSP) works by giving the model pairs of sentences and it learns to predict whether or not the second sentence logically follows the first.

Pre-trained Language Models: A Giant Leap

Once transformers became popular, researchers realized they could pre-train huge language models on massive amounts of text — and then fine-tune them for specific tasks.

Instead of training a model from scratch every time you wanted to do translation, or summarization, or classification, you could just start with a giant model that already knew a lot about language, and tweak it slightly.

Some of the most important pre-trained models today include:

BERT: A transformer-based model that reads text bidirectionally, learning to predict missing words.
GPT (Generative Pre-trained Transformer): A model that learns by predicting the next word in a sequence, leading to natural text generation (and later, chatbots).
T5 (Text-to-Text Transfer Transformer): A model that frames everything as a text generation task, from translation to summarization.

These models have powered major advances in search engines, customer support chatbots, translation apps, and even tools like ChatGPT. Let us look at GPTs as an example of these models.

GPTs are trained by a huge input corpus such as text books, Wikipedia, blogs, websites, and other sources. With the word Transformers in the sentence, we know that they will use Transformers to process the training data. As the input data is fed into the model, it uses attention and self-attention to consider the context of the word in a sentence based on all of the other words. This self-supervised learning approach lets the model learn from vast amounts of unlabeled text, though the training data is often filtered for quality.

When you give a GPT a prompt, it will first break the sentences you type in down and generate embeddings of the sentences. It passes these embeddings through the model architecture to understand the relationships between the words for their meanings.

The Generative part creates output text by computing probable sentences and their order to generate a coherent output. The key here is that the output is entirely based on probabilities. As your input text goes through the model, it predicts output that it thinks matches your input by generating information that is inside the model that is “close” to the meaning of your input prompt. It may seem like the model understands what you have said, but in the end everything is based on probability. This is also why GPTs can hallucinate, where the model can incorrectly predict output that makes no sense or has incorrect information.

Challenges Still Remain

While modern NLP has come a long way, it’s far from perfect. Some of the challenges include:

Bias in training data: Large language models can reflect the biases of the data they were trained on.
Understanding rare or low-resource languages: Most models are still strongest in English and other major languages.
Cost and energy usage: Training massive models requires enormous amounts of computing power.

Researchers are actively working on these issues, but they show that even with today’s powerful tools, the complexity of human language is still a hard problem.

Wrapping Up

From handcrafted rules to massive transformers, NLP has evolved faster in the last five years than almost any other field in AI.

While English — with its ambiguity, irregular grammar, and endless exceptions — remains a tough language for machines to master, the combination of contextual understanding, transformers, and pre-trained models has made it possible to do things that seemed like science fiction just a decade ago.

The future of NLP is even more exciting, with research moving toward multilingual models, more efficient architectures, and even models that can understand images, sounds, and language together.

Thanks for joining me on (and waiting on me to finish) this quick tour through the world of NLP. Until next time.

Some Background on Natural Language Processing – Part 2: Older Methods and Techniques

Posted on December 16, 2024 by bigbubba

Last time we learned why English is a hard language, both for humans and especially for computers. For this time I think we will look at the past of NLP to understand the present. It actually has been an interesting history to get to where we are now.

Historic NLP methods relied on approaches that focused more on linguistics and statistical analysis. Computers were not as powerful as they are now, and definitely did not have GPUs with the crazy amount of processing capability that they have now. They can be broadly grouped into six categories: rule-based systems, bag of words, term frequency-inverse document frequency, n-grams, Hidden Markov Models, and Support Vector Machines. Let us explore these methods to see how they worked and led to today’s more powerful systems. This post will have a lot of links to other sources of information as it is intended to be more of a gentle introduction than an exhaustive analysis.

Rule-Based Systems

Rule-Based Systems are perhaps the oldest method of trying to get a computer to be able to process a language. These systems were built on predefined linguistic rules and handcrafted grammars. Grammar in this case is not about usage, such as matching nouns and verbs and what not. In this case the grammar is known as a formal grammar that is a mathematical structure of a sentence. One specific method is a context-free grammar that has a set of rules of how to form sentences from smaller words and phrases.

Rule-based systems explicitly encode language in a series of rules so that text can be parsed. This parsing would then be able to identify parts of speech and identify sentence structure. These rules were typically created by linguists and software developers working together to try to codify language as a series of mathematical rules.

The rules they would create were simple so that they could be combined together for more complex sentences. For example, one rule could be “any word that ends in ing is likely a verb”, while another could be “a noun follows determiner words such as a, then, and an.” The developer would then take these rules to create a parser to take in sentences and break them down into their syntactic components using the mathematically-defined grammars.

Since I am a cat person, let us look at this very simple example sentence: “My cat is playing with a toy.” In this case, the would be tagged as a determiner words, so cat would be tagged as a noun. Is playing would be identified as a verb. A is another determiner word, so toy would be tagged as another noun.

While this sounds simple, these systems were complex and labor intensive. They required the work of expert linguists and developers to build the parsing systems. They required a lot of maintenance as problems were found and the rules needed to be modified. As they were based on formal mathematics, they were not very flexible when it comes to something sloppy like English. This meant that rules after rules would have to be developed and stacked on top of one another.

Bag of Words

The Bag of Words (BoW) method was an early attempt at applying statistical processing to NLP. It sought to represent text in a numerical form so that statistical algorithms could be applied. These algorithms could then do things like classify documents or determine the similarity of two different documents.

The BoW method works by treating the input text as an unordered collection of words. In this case, however, neither the order of words nor the sentence grammar are used. The document is tokenized, where the words are converted into tokens. A list of each unique token is then created. The input text is analyzed and each word is counted to determine the frequency of occurrence in the document. The document is then converted into a vector of word frequencies that is the same size as the number of unique words in the document.

You may know vectors from geometry where they represent lines and their direction. It is still similar in this case where it is a mathematical array of numbers. In the case of a frequency representation, the array could look something like [10, 5, 3, 9] and so on, where each number represents the frequency a word appears.

For computing document similarity, both documents would be turned into frequency vectors by enumerating their unique words and then creating a vector that is the same length as the total number of unique words. The frequencies of each word are inserted into the array, with a zero placed in locations where a word occurs in one document but not in the other. The order of the word frequencies of the array can be sorted so each array entry corresponds to the same one in the other document.

The comparison is easier to think of back in terms of geometry. There are two methods to compute the distance between the vectors: cosine similarity and Euclidean Distance. For cosine similarity, the cosine of the angle between the two vectors in multidimensional space is calculated with the result being between -1 and 1. A value of 1 means the vectors are identical, 0 means they are orthogonal to each other (not similar), and a -1 means they are very different from each other, and other values are between one of the three.

Euclidean Distance calculates the straight-line distance between each point in the vectors in Euclidean space (told you geometry would come into it). The distance metric is actually based on the Pythagorean Theorem that you may remember from high school. The output of this is a vector that is the same length as the document vectors and each entry is the distance between the corresponding points. The smaller the distances, the more similar the documents are.

While BoW is good for things like document similarity, it has several drawbacks when compared to other NLP techniques. For one, since it just deals with statistical frequencies, it ignores things such as order of words and their context. This means there is no real semantic information that is carried over. The other drawback is that for large documents that are very different, the generated vectors can be largely empty and become cumbersome for distance computation algorithms.

Term Frequency-Inverse Document Frequency

Term Frequency-Inverse Document Frequency (TF-IDF) came about as an improvement on the BoW concept by weighing words based on importance in a document relative to a larger group of documents. It attempts to identify significant words in a document while ignoring common words that frequently occur, such as the, and, but, and so on.

To understand how it works, let us break down the various parts of how it works. The first part is the Term Frequency (TF). As it sounds, this is a measure of how frequently a word appears in a document. Thus words that occur more will have a higher score.

The next part of the calculation is the Inverse Document Frequency (IDF). This equation is a measure of how often a word occurs across the entire set of documents. Words that do not appear often are assigned a higher score.

Now we multiply the two values together to come up with the TF-IDF value.

This value reflects how “important” a word is in a document relative to the total number of documents. Word that occur often in one document but not in the total number of documents would end up with a high TF-IDF score. This can be use to filter out the common words I mentioned above.

TF-IDF also has several drawbacks when it comes to NLP. For one, it still ignores word order and the context in which words are used. As previously mentioned, word order and context are very important when attempting to determine the actual meaning of a word and how it is used.

This leads into the next issue with TF-IDF. Without context, it fails to account for words that mean the same thing. Consider something like dog and puppy. TF-IDF would treat them as different words instead of recognizing they are similar.

N-Grams

N-Grams were created to try to address the limitations of lack of word context. They consider sequences of words to preserve word order versus looking solely at single words. This consideration made N-Grams popular for modeling textual language and actual text generation.

An N-Gram is defined as a contiguous sequence of N items that are in text. Consider the example sentence of “The cat sleeps on the chair”. A unigram (1-Gram) would be a single word such as “The” and “cat”. A bigram (2-Gram) is a sequence of two words, such as “The cat” and “cat sleeps”. A trigram (3-Gram) is then a sequence of three words, such as “The cat sleeps” or “cat sleeps on”. This continues on for larger numbers of N and counts the frequency that these sequences occur in a text. As such, the higher the value of N, the more context is captured from the text.

You might immediately see a problem with N-Grams. As the value of N increases, the number of possible word sequences goes up nearly exponentially. For example, a vocabulary of V would have approximately V^N N-Grams. This makes the model much harder to train when there is a limited amount of text. Higher values of N also can cause a smaller number of matches across documents. N-Grams still fail to capture dependencies across sentences as they only capture localized context.

Hidden Markov Models

Hidden Markov Models (HMMs) were used quite a bit (and periodically used today) for NLP tasks like part-of-speech tagging and named entity recognition (NER). They are probability models that represent the sequence of hidden states that exist in systems that are based on observable events.

A HMM consists of several parts:

Hidden States represent the phenomena of a system that are unobservable. In relation to NLP, these hidden states could be parts of speech.
Observations are the words or their tokenization from text that are observed directly.
Transition probabilities are the probability of moving from one hidden state to another. In NLP, nouns are usually followed by verbs, so that the probability of going from a noun to a verb is very high.
Emission probabilities are the probabilities of observing a particular word given a particular hidden state. For example, if the hidden state is a verb, then the probability of the word sleep given the verb state would be high.

HMMs work by assuming that the system being modeled (not necessarily just for NLP work) is a Markov Process. Put simply, a Markov Process means that the probability of each state depends only on the previous state. With respect to NLP, the hidden states can be considered to represent parts of speech, such as nouns and verbs. The observed events of the system would be the words themselves. A HMM then uses the above probabilities to model the sequence of hidden states that most likely produced the observable text.

HMMs require training data to be able to predict the parts of speech in a sentence. This is also their drawback in that they require manually labeled training data to be able to estimate probabilities.

Support Vector Machines

The last “old school” NLP technique we will discuss is Support Vector Machines (SVM). They were one of the first machine learning algorithms used for classification tasks. SVMs were useful for tasks such as sentiment analysis and spam detection. Much like machine learning tasks today, they worked by finding a hyperplane that separates different classes in high-dimensional spaces (yes this is hard to wrap your mind around, I would suggest reading the links to learn more).

Machine learning is based on vectors, or arrays of values. SVMs are no exception. They work by encoding the text document as a numerical vector such as TF-IDF values or others. They then search to find an optimal plane / boundary that can completely separate multiple classes (for example, text that speaks positively about something versus negatively about it). Kernel functions were used in cases where the plane that separated the classes was non-linear. They would map the input space into another space that would allow the classes to be divided by a hyperplane. This early machine learning method was very useful for classification tasks.

While they were good for classification tasks, they were not good at modeling sequences or structures in texts. Outside of classification they did not really perform as well. As is often the case with machine learning, large scale datasets were computationally expensive to train.

Conclusion

This basically sums up some of the classification methods for NLP. You can see how things developed from statistical analysis to the beginnings of machine learning. These methods were good at some things, but failed to capture the complexity and context of languages. They heavily relied on preprocessing steps such as tokenization, common and stop word removals, manual labeling, and so on. These steps could be time consuming and still not capture meaning.

The move to more machine learning methods would finally enable better handling of ambiguity and context in various languages. Next time we will close with modern techniques at NLP and how they have created a revolution in processing human languages.

Some Background on Natural Language Processing – Part 1: English

Posted on October 30, 2024 by bigbubba

I thought I would switch topics and start to talk about things like Large Language Models and how they could be applied to things like Geographic Information System (GIS) data. To do this, I think first it would be good to talk about the basis for some of these tools, such as natural language processing (NLP).

NLP is the basis for tools such as LLMs and even the ability to extract information from GIS data such as people, places, and things. As such, I feel it is important to understand NLP as it provides a lot of value to GIS information as well as advanced processing of such information.

Some background about English

Before I dive into how NLP works, I first want to talk about the English language and how it relates to NLP. This will give some background about why NLP is hard and how far we have come over the years. So we will have a bit of a history lesson. I love history and I finally have a reason to blog about it. Next post we will dig into the background of how NLP used to work and how it works today.

What is the English Language?

First off, English is a horrible language. It is a West Germanic language that is a part of the Indo-European language family. It is a mashup of several different languages and continues to evolve almost yearly with new terms being added. It got its start in what is now the UK prior to the fifth century when it was inhabited Celtic-speaking peoples. Then came the Romans who tried to enforce Latin, but the natives said no for the most part.

Old English

Next came the Anglo-Saxons who came to Britain from the area around the northwest of modern Germany. These Anglo-Saxon travelers brought their Ingvaeonic languages to Britain, displaced the Celts, and then started what we now call Old English. Then while Old English was forming, we throw in some influence form the Vikings who invaded off and on and imparted some of their Norse words into the mix. Then, to top things off, we had four known variants of the language that developed in Britain.

An example comes from Beowulf what was written around 1000 AD and here is the first line of it:

HWÆT: WE GAR-DENA IN GEARDAGUM (So. The Spear-Danes in days gone by)

Middle English

Then came the Normans in 1066. The Normans at the time spoke Anglo-Norman / Anglo-Norman French / Old French. This became the language of the upper class and the courts of the time. English then brought in some of the French words to add to the mix. However, this did simplify the grammar of the language a bit. This then went on to give us Middle English. An example of this is found in The Canterbury Tales, where one of my high school English teachers made us memorize the first verse of it and scarred me for life.

Whan that Aprill with his shoures soote (When April with its sweet-smelling showers)

Early Modern English

Then we move on to Early Modern English that started to take root in the 15th to 17th centuries, better known as the Renaissance. Here we decided to throw in some Latin, Greek, and even more French words for good measure. The good news is that we started to simplify the grammar even more. During this time, English went through the Great Vowel Shift. This change modified how long vowels were pronounced and thus threw in some spelling changes. The results of this shift have had wide-ranging impacts, including the basis for spelling and pronunciation mismatches in Modern English. Plus the English and French began their love/hate relationship so it is possible that some anti-French sentiment caused some words to be modified to lose their influence, so to speak. This was also the language used by Shakespeare who also added words to the language. Here is an example of what the Lord’s Prayer would have looked like during this period.

Our father which art in heauen, hallowed be thy name. Thy kingdome come. Thy will be done, in earth, as it is in heauen.

Modern and Contemporary English

We finally arrive at Modern English which started in the 17th century and continues on today. Thanks to the British Empire, English spread globally. This global contact ended up adding more word to the language. It went on to become the language of business, science, and diplomacy.

Then you may have heard of a bit of a spat between the British and a bunch of upstart colonies that formed the United States. As time went on, American English began to differ a bit from British English, mainly in some spellings and different words for things (a car’s hood vs a car’s bonnet). As the empire fell, we also got various dialects of English including Canadian and Australian English.

A Language that Confuses Native Speakers

This brings us to Contemporary English, a language with a large history of change and influences and words from various cultures. Unlike Latin-based languages, it really does not follow many rules. It tends to reuse words with different tenses and even different parts of speech. The same word can have a lot of different and even non-related meanings. Consider this, a perfectly valid sentence today:

I banked(1) on the advice from my friend and ended up making a lot of bank(2) by playing the bank(3) of slot machines at the casino. To keep it safe, I decided to take it to the new bank(4) that they built on the south bank(5) of the river.

As a native American English speaker, this makes me want to cry. This is why it is so hard for non-native English speakers to learn the language. Let us look at the abomination that I just wrote. The word bank is used as a:

Verb: meaning to rely on
Noun (slang really): meaning money
Noun: meaning a row of similar items
Noun: meaning a place to store things for later use
Noun: sloping raised land near a river

Imagine you are someone trying to learn English and you come across that sentence. Your head would explode. Think about how hard it would be to figure out what was actually being said. Note that I could have added another verb use by saying something like “On the way there, I had to bank to the left in my car to get in the proper lane.” Thankfully I felt bad enough making the above sentence in the first place.

Now imagine a computer trying to parse this. Remember that a computer is a glorified calculator. Admittedly that sentence was made up to be a specific edge case, but it illustrates the point that if a human would have trouble understanding something, you can be sure that I computer will too. And NLP has come a long way and these days would actually be able to make use of that sentence.

Distinguishing the meaning of the same word used in many different contexts has been, and continues to be, one of the primary reasons that getting NLP right has been so hard. We have ended up with many historic and structural complexities that have led to the language today.

Historically, most NLP research has been done on the English language due to its dominance in commerce and the link. NLP into other languages is working to catch up, but for these blog posts I am going to consider that to be out of scope since, well, I do not speak those languages.

And to add in another complication, research and methods of NLP for the English languages do not exactly work well for other languages. Each language has its own grammar and usage. Some languages such as Spanish and French (Romance Languages) are similar at least. But it is basically an apples to oranges comparison to try to apply English NLP principals to something like an Asian language.

Conclusion

I think I will leave things here for today. Next time I will go into the history of how NLP worked in the past before moving on to how it works today.

Stupid LIDAR Tricks Finale

Posted on October 10, 2024 by bigbubba

I thought I’d finally wrap this up so I can move on to other things. Since I’ve last posted, I replaced my Jankinator 1000 (nVidia Tesla P40 with a water cooler) and my nVidia RTX 2060 with an Intel Arc A770. It has 16Gb of VRAM and is actually a pretty fast GPU on my Linux box.

So far I’ve had pretty good luck getting things like TensorFlow and PyTorch working on it, as mentioned in a previous blog post. The only thing so far that I haven’t gotten to work 100% is Facebook’s Segment Anything Model 2 (SAM2) (which of course is what I mentioned in the last post I wanted to try to use with the LIDAR GeoTIFF). Basically now down to running out of VRAM although I’m not sure exactly why since I’ve tweaked settings for OpenCL memory allocation on Linux, etc. I’ve finally given up on that one for now and decided to just use the CPU for processing.

As a refresher, here is the LIDAR GeoTIFF I have been using.

SAM2 has the ability to automatically generate masks on input images. This was of interest to me since I wanted to test it to try to automatically identify areas of interest from LIDAR. Fortunately, the GitHub repo for SAM2 has a Jupyter notebook that made it easy to run some experiments.

With all of the default parameters, we can see that it only identified the lower right corner of the image. I tweaked a few of the settings and came up with this version:

The second image does show some areas highlighted. It got a grouping of row houses on the bottom of the image. It also found a few single family houses as well. However, it also flagged areas where there is really nothing of interest.

Finale

What have we learned from all of this? Well, LIDAR is hard. Automatically finding features of interest in LIDAR is also hard. We can get some decent results using image processing and/or deep learning techniques, but as with anything in the field, we are no where near 100%.

Previously I posted about training a custom RCNN to identify features of interest from GeoTIFF LIDAR. It was somewhat successful, although, as I said, it needed a lot more training data than I had available.

I do think that over time, people will develop models that do a good job of finding certain areas of interest in LIDAR. Some fields already have software to find specific features in LIDAR, especially in fields such as archaeology. And this is probably how things will continue for a while. We probably will not have a generalized “find everything of interest in this LIDAR image” model or software for a long time. However, it is possible to train a model to identify specific areas.

I am also working with LIDAR point data as well versus GeoTIFF versions. Point data is a lot different beast in that it has various classifications of the points after it has been processed. You can then do things such as extract tree canopy points or bare ground points. Conversion to a raster necessarily looses information as things have to be interpolated and reduced in order to produce a raster. I’ll post some things here in the future as while point cloud data can be harder to work with, I think the results are better than what can be obtained via rasters.

Saying Goodbye to my Census Tiger Data

Posted on October 6, 2024 by bigbubba

For a long time now I’ve maintained a version of the Public Domain Census Tiger Data converted from county-level to state-level. Over the years I’ve actually had a lot of those shape files downloaded so I’m glad they were useful to some people!

However, the Census is now putting out geopackages of there data at both the national and state levels. I’d also like to thank the US Census for doing that as I think state-level data is way more usable than county level!

As a result, I think there’s not really a reason for me to host the state-level data any more. I’ll still host the script to download and create a PostgreSQL/PostGIS database at my github repo, and might even get around to adding scripts to automatically process the geopackages.

So, so long my state-level Census data and thanks for all the fish!

How I Got TensorFlow and PyTorch working on an Intel Arc A770 GPU

Posted on September 8, 2024 by bigbubba

7 Nov 2024 Edit: Updated the command to install Pytorch.

22 Dec 2024 Edit: Updated and simplified the software install as there are aliases now.

Note: It’s much easier if you upgrade your environment to just nuke the conda install and recreate it if/when new versions of the intel software come out.

Recently I replaced my Jankinator 1000 with an Intel Arc A770 16GB card. While this card is a 16GB card versus 24GB, it’s a lot faster and, well, it is not an NVIDIA card. Plus, I can do things like mixed precision processing and modify batch sizes to cope with the loss of 8GB. I will spare you my thoughts on certain companies and their monopolies in Deep Learning systems.

I thought I would write up this post since, well, some Intel documentation is a bit scattered and is not the easiest to follow. Plus some of it will end up messing with your system if you are not careful. For reference, I’m running Linux Mint 22, which is based on Ubuntu 24.04 Noble.

So for the standard disclaimer, these instructions got everything working for me. I make absolutely no guarantees that they will work for you. If your house burns down or a portal opens and Cthulhu appears, don’t blame me.

Drivers

First off, make sure you are running kernel version 6.8.0-41 or later. In some earlier kernels, someone posted a patch that caused a regression in the compute engines on the Arc GPUs (https://github.com/intel/compute-runtime/issues/726). This has been fixed in more recent kernels on Ubuntu. If you are on Linux Mint like I am (and would highly recommend), you should have the latest HWE kernel. If not, go ahead and install it first.

Next we will follow some of these instructions comes from Intel’s documentation at https://github.com/intel/intel-extension-for-tensorflow/blob/main/docs/install/experimental/install_for_arc_gpu.md. Note that it does not currently mention Ubuntu 24.04, but trust me, it works. We will mostly follow their documentation to properly install the drivers, just with a few changes.

Core Drivers

First set up the gpg key for the Intel repo.

sudo apt-get install -y gpg-agent wget
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg

Now we install the Intel GPU repository. We differ from their instructions here because while the Nobel repository is not mentioned, trust me it is there.

echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu noble unified" | sudo tee /etc/apt/sources.list.d/intel-gpu-noble.list
sudo apt-get update

Next you will need to install the proper packages.

apt install intel-opencl-icd libze1 intel-level-zero-gpu-raytracing intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 libegl-mesa0 libegl1 libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo

This is another difference from the Intel documentation. libze1 has replaced intel-level-zero-gpu, although intel-level-zero-gpu-raytracing is still around. Also libegl1-mesa seems to have been renamed to libegl1 except for the dev package.

You should probably reboot now since the intel-media-va-driver-non-free driver contains some extra functionality that the fully open source version does not.

One API

Now we go ahead and follow their instructions for setting up the Intel One API repository.

wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | sudo gpg --dearmor --output /usr/share/keyrings/oneapi-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt-get update

Here we also stray a bit from their documentation. I have found we need to install a LOT of packages from One API to make sure everything works, including their TensorFlow and PyTorch extensions.

sudo apt install intel-basekit

Yes that is a lot of packages, but you will need them eventually.

Now add these statements in your .bashrc to ensure that all the necessary environment variables are set:

# Intel stuff
source /opt/intel/oneapi/setvars.sh
source /opt/intel/oneapi/compiler/latest/env/vars.sh
source /opt/intel/oneapi/mkl/latest/env/vars.sh

Save your .bashrc file and now whenever you open a terminal you should see something like this:

:: initializing oneAPI environment ...
bash: BASH_VERSION = 5.2.21(1)-release
args: Using "$@" for setvars.sh arguments:
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: vtune -- latest
:: oneAPI environment initialized ::

If you want, you can run clinfo to make sure OpenCL is working on the Arc.

bmaddox@sdf1:~$ clinfo
Number of platforms 2
Platform Name Intel(R) OpenCL
Platform Vendor Intel(R) Corporation
Platform Version OpenCL 3.0 LINUX
Platform Profile FULL_PROFILE
......
Platform Name Intel(R) OpenCL Graphics
Number of devices 1
Device Name Intel(R) Arc(TM) A770 Graphics
Device Vendor Intel(R) Corporation
Device Vendor ID 0x8086
Device Version OpenCL 3.0 NEO

I’ve abbreviated a lot of output, but you should see the Arc listed as a platform after running clinfo.

Congratulations! The hardest part is now over. Now it is time to get TensorFlow and PyTorch working with the Intel Arc GPU.

TensorFlow

Now we need to install Anaconda/Miniconda. This is because the most recent version of Python that the Intel TensorFlow and PyTorch extensions support is Python 3.11. You can find instructions on how to install conda from their websites.

Once you have it created, we will first work on the Intel TensorFlow extension.

conda create -n "tensorflowintel" python=3.11

or whatever you want to call your virtual conda environment. Activate that environment with:

conda activate tensorflowintel

Next we need to install the TensorFlow extension and TensorFlow itself.

pip install --upgrade intel-extension-for-tensorflow[xpu]

Make sure you specify the [xpu] at the end or else everything will end up using the CPU.

Now we verify that the Intel TensorFlow extension works. Run python to get into an interpreter and then type in the following:

(tensorflowintel) bmaddox@sdf1:~$ python
Python 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf

Now, after running the import statement, you will see a lot of output. Ignore anything that mentions cuda since of course we are not going to install cuda without an NVIDIA card. You will see something like this:

2024-09-08 12:01:20.128862: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-08 12:01:20.401696: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-08 12:01:20.401756: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-08 12:01:20.451490: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-08 12:01:20.557915: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-08 12:01:20.559102: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-09-08 12:01:21.568560: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-09-08 12:01:23.797012: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /tensorflow/core/bfc_allocator_delay. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2024-09-08 12:01:23.801616: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /xla/service/gpu/compiled_programs_count. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2024-09-08 12:01:23.814748: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_executions. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2024-09-08 12:01:23.814784: W external/local_tsl/tsl/lib/monitoring/collection_registry.cc:81] Trying to register 2 metrics with the same name: /jax/pjrt/pjrt_executable_execution_time_usecs. The old value will be erased in order to register a new one. Please check if you link the metric more than once, or if the name is already used by other metrics.
2024-09-08 12:01:24.959105: I itex/core/wrapper/itex_gpu_wrapper.cc:38] Intel Extension for Tensorflow* GPU backend is loaded.
2024-09-08 12:01:24.959972: I external/local_xla/xla/pjrt/pjrt_api.cc:67] PJRT_Api is set for device type xpu
2024-09-08 12:01:24.960001: I external/local_xla/xla/pjrt/pjrt_api.cc:72] PJRT plugin for XPU has PJRT API version 0.33. The framework PJRT API version is 0.34.
2024-09-08 12:01:25.106392: I external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:134] Selected platform: Intel(R) Level-Zero
2024-09-08 12:01:25.106772: I external/intel_xla/xla/stream_executor/sycl/sycl_gpu_runtime.cc:159] number of sub-devices is zero, expose root device.
2024-09-08 12:01:25.107555: I external/xla/xla/service/service.cc:168] XLA service 0xac38370 initialized for platform SYCL (this does not guarantee that XLA will be used). Devices:
2024-09-08 12:01:25.107570: I external/xla/xla/service/service.cc:176] StreamExecutor device (0): Intel(R) Arc(TM) A770 Graphics, <undefined>
2024-09-08 12:01:25.109696: I itex/core/devices/gpu/itex_gpu_runtime.cc:130] Selected platform: Intel(R) Level-Zero
2024-09-08 12:01:25.110088: I itex/core/devices/gpu/itex_gpu_runtime.cc:155] number of sub-devices is zero, expose root device.
2024-09-08 12:01:25.110521: I external/intel_xla/xla/pjrt/se_xpu_pjrt_client.cc:97] Using BFC allocator.
2024-09-08 12:01:25.110541: I external/xla/xla/pjrt/gpu/gpu_helpers.cc:106] XLA backend allocating 14602718822 bytes on device 0 for BFCAllocator.
2024-09-08 12:01:25.112748: I external/local_xla/xla/pjrt/pjrt_c_api_client.cc:119] PjRtCApiClient created.

Pay attention to the last few lines. They should show that the Arc is detected and available. Next verify by running this:

>>> gpus = tf.config.list_physical_devices('XPU')
>>> for gpu in gpus:
...     print("Name:", gpu.name, " Type:", gpu.device_type)
...
Name: /physical_device:XPU:0 Type: XPU
>>>

If you run into any issues, you may have to import the Intel TensorFlow extension to make sure everything works (you will need it anyway if you are modifying existing sources)

>>> import intel_extension_for_tensorflow as itex
>>> print(itex.__version__)
2.15.0.2
>>>

Congratulations! You now have a virtual environment set up to work with TensorFlow. You can probably get this to work with existing source by making sure to downgrade the version of TensorFlow you use to the above and install the Intel extension. Then change references to GPU to XPU to make sure everything is using the Intel card.

PyTorch

Since we went through everything to get the drivers and TensorFlow working, we can now look at using the Intel PyTorch extension. Note, I have found that it’s better to keep TensorFlow and PyTorch in separate environments. That way you will be less likely to run into issues.

First off a couple of notes. I am purposely not posting links to these sites because you should not go there. Pain and sorrow will only come to you if you do. If you go to the PyTorch website, they will mention rebuilding PyTorch so that it supports Intel XPU devices. Do NOT do this. Intel also has a website out there that mentions adding another repository to install PyTorch and some additional drivers. Do NOT do this either. Doing so will likely break everything. Yes it is a little fragile at the moment, that is the whole reason I am writing this 🙂

We will again create a Python 3.11 environment using conda.

conda create -n "pytorchintel" python=3.11

Activate this environment with

conda activate pytorchintel

Now we install PyTorch into this environment:

python -m pip install torch==2.3.1 torchvision torchaudio==2.3.1 intel-extension-for-pytorch==2.3.110+xpu oneccl_bind_pt==2.3.100+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

Again run the Python interpreter and run the following to verify everything is working in this environment:

(pytorchintel) bmaddox@sdf1:~$ python3
Python 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import intel_extension_for_pytorch as ipex
>>> torch.xpu.is_available()
True

That is it! You are now done!

As with TensorFlow, you will need to make some code changes for PyTorch to work. Instead of sending the model to “GPU”, you will need to replace calls so they look like this:

model = model.to('xpu')
data = data.to('xpu')
model = ipex.optimize(model)

Conclusion

While it is a bit fragile now, I have had good luck with using my Arc A770 for deep learning and computer vision tasks. Things like Stable Diffusion using OpenVino work REALLY well. Other things work by removing and installing packages after you install their requirements and making some minor code modifications. Intel has some really good documentation available online to port code to use their XPU devices and I highly suggest reading them before you start trying to run existing TensorFlow and PyTorch applications.

Stupid LiDAR Tricks Part 2

Posted on June 1, 2024 by bigbubba

In this part of the series, I want to go over image salience and how it can be applied to finding “interesting” things in LiDAR. Image salience (usually used to make salience maps) refers to the ability to identify and highlight the most important or attention-grabbing regions in an image. It is meant to highlight areas of an image where the human eye would focus first. Salience maps are used to visualize these regions by assigning a salience value to each pixel, indicating its likelihood of being a point of interest. This technique is widely used in computer vision for tasks such as object detection, image segmentation, and visual search.

Background Research

Salience research actually has been going on for decades now. It began back in the 1950’s where it was a field of psychology and neuroscience that sought to understand how humans perceive and prioritize visual information. It mainly stayed in the neuroscience and psychology fields until roughly the end of the 1970’s.

In the 1980’s David Marr proposed a computational theory of vision that provided a framework for understanding the stages of how visual systems could process complex scenes. This can be considered the beginning of trying to recreate how humans prioritized “interesting” parts of an image. This also can be considered the base upon which later computer science work would be performed.

In the 1990’s the concept of salience maps was proposed to model how the human visual system identifies areas of interest by Itti, Koch, and Neibur. In 1998 they created one of the first computational models that combined features such as color, intensity, and orientation to calculate areas of interest. These algorithms added more complex features during the 2000’s.

With the rise of deep learning in the 2010’s, image salience took a turn and began to use CNNs for detection. By definition, a CNN learns hierarchical features from large datasets and can identify complex patterns in images. Combined with techniques such as adversarial learning, multi-scale analysis, and attention mechanisms, salience map generation is now more accurate than it has ever been.

CNN Salience Methods

Let us briefly examine how CNNs / deep learning are used in modern times for salience detection:

By definition, convolutional layers in a CNN extract local features from an input image. Early layers in the network capture low-level features such as edges and textures, while later levels in the network capture higher-level features such as actual objects. Multi-scale analysis to process features at different resolutions can also help with salience detection with a CNN.
Pooling layers reduce the spatial dimensions of the feature maps. This makes the computation more efficient and can even provide a form of spatial invariance so that features do not need to be the exact same scale.
The final fully-connected layer can then predict the salience map of the image based on the information gathered through the various layers.
An encoder-decoder architecture can be used as another extraction mechanism. Encoders extract features from the image using convolutional layers while gradually reducing the spatial dimensions of the image so that it can increase the depth of the feature maps.
Decoders can then reconstruct the salience map from the encoded features. In this case they may use techniques such as transposed convolutions to upscale the image or “unpooling” to restore the image to the original size.
Feature pyramid networks can process an image at multiple scales to gather coarse and fine details and then integrate the information into a final salience map.
Finally, generative adversarial networks can be used to produce salience maps by using a generator to create a map and a discriminator to evaluate the quality of the map. The generator learns to produce more accurate maps over time by attempting to “fool” the discriminator.

Salience Maps

So what is a salience map? A salience map is a representation that highlights the most important or attention-grabbing regions in an image. It assigns a salience value to each pixel, indicating its likelihood of being a region of interest. They are the end result of running a salience detector and can be used for:

Object detection by finding and localizing objects in an image.
Image segmentation by dividing the image into segments or objects based on their salience.
Visual search which can be used for things like scene understanding and image retrieval by identifying which areas should have more processing performed.
Attention prediction can be used to highlight areas where a person would be most likely to focus their attention.

Why Image Salience?

The last use is what this post is about: automatically finding areas in LiDAR that need to be inspected or to find anomalies in LiDAR. Imagine you are a large satellite company that collects thousands of images a day. It would be time consuming for a human to scan all over each image for something of interest. Salience maps are useful here in that they can help guide a human to places they need to examine. Potentially, this could be a huge time saver for things like image triage.

LiDAR in raster format provides some challenges, though, for image salience. For one, LiDAR represents dense, three-dimensional data instead of a normal two-dimensional image. It requires pre-processing, such as noise reduction and normalization. LiDAR can contain varying point densities and occlusions in the point cloud. This makes LiDAR harder to analyze as we are dealing with a “different” type of image than normal.

Conversion of point data to raster can also make things problematic for salience detection. LiDAR has several classes, one such class being bare earth. In most cases, rasterization processes will convert the points to heights based on ground level. However, in cases of buildings, this would typically have void areas because the laser cannot penetrate a building to find the ground level. Most tools will fill these voids with a flat ground-level elevation as many people do not wish to see empty areas in their data. This can make structures on bare earth rasters look similar to things like roads, thus an algorithm might have trouble differentiating the two.

Image Salience and LiDAR Workflow

Since I did not really cover this in the last post, here I will outline a workflow where salience and/or segmentation can be used to help with the processing of large LiDAR datasets (or really any type of raster dataset).

Once the point data has been converted to a raster, salience maps can be generated to identify and extract areas in the imagery that appear to contain meaningful features.
A human can either manually examine the identified areas, or some other complex object detection analysis algorithm can be run against the areas. This is where the time saving comes into play as only specific parts of the image are examined, not the entire image itself.
Features that are recognized can then be used for higher level tasks, ranging from identifying geographic features to detecting buildings.

Enough talk and history and theory, let us see how these algorithms actually work. This source can be found at on github under the salience directory. This time I made a few changes. I added a config.py to specify some values for the program to avoid having a lot of command line arguments. I also have copied the ObjectnessTrainedModel from OpenCV into the salience directory for convenience as not all packaging on Linux actually has the model included.

As a reminder, here are the input data sets from the last post (LiDAR and Hill Shade):

First off we will look at the algorithms in the venerable OpenCV package. OpenCV contains four algorithms for computing salience maps in an image:

Static Saliency Spectral Residual (SFT). This algorithm works by using the spectral residual of an image’s Fourier transform to generate maps. It converts the image to the frequency domain by applying the Fourier transform. It then computes the spectral residual by removing the logarithm of the frequency amplitude spectrum’s average from the logarithm of the amplitude spectrum. It then performs an inverse Fourier transform to convert the image back into the spatial domain to generate the initial salience map and applies Gaussian filtering to smooth out the maps.
Static Saliency Fine Grained (BMS). This algorithm uses Boolean maps to simulate how the brain processes an image. First it performs color quantization on the image to reduce the number of colors so that it can produce larger distinct regions. It then generates the Boolean maps by thresholding the quantized image at different levels. Finally, it generates the salience map by combining the various Boolean maps. Areas that are common across multiple maps are considered to be the salient area of the image.
Motion Salience (ByBinWang). This is a motion-based algorithm that is used to detect salient areas in a video. First it calculates the optical flow between consecutive frames to capture the motion information. It then calculates the magnitude of the motion vectors to find areas with significant movement. Finally, it generates a salience map by assuming the areas with higher motion magnitudes are the salient parts of the video.
BING Saliencey Detector (BING). This salience detector focuses on predicting the “objectness of image windows, essentially estimating how likely it is that a given window contains an object of interest. It works by learning objectness from a large set of training images using a simple yet effective feature called “Binarized Normed Gradients” (BING).

For our purposes, we will omit the Motion Salience (ByBinWang) method. It is geared towards videos or image sequences as it calculates motion vectors.

As this post is already getting long, we will also only look at the OpenCV image processing based methods here. The next post will take a look at using some of the more modern methods that use deep learning.

Static Saliency Methods

The static salience methods (SFT and BMS) do not produce output bounding boxes around features of an image. Instead, they produce a floating point image that highlights the important areas of an image. If you use these, you would normally do something like threshold the images into a binary map so you could find contours, then generate bounding boxes, and so on.

First up is the SFT method. We will run it now on the LiDAR GeoTIFF.

As you can see, when compared to the above original, SFT considers a good part of the image to be unimportant. There are some areas highlighted, but they do not seem to match up with the features we would be interested in examining. Next let us try the hill shade TIFF.

For the hill shade, SFT is a bit all over the place. It picks up a lot of areas that it thinks should be interesting, but again they do not really match up to the places we would be interested in (house outlines, waterways, etc).

Next we try out the BMS method on the LiDAR GeoTIFF.

You can see that BMS actually did a decent job with the LiDAR image. Several of the building footprints have edges that are lighter colored and would show up when thresholded / contoured. The streams are also highlighted in the image. The roadway and edges at the lower right side of the image are even picked up a bit.

And now BMS run against the hill shade.

The BMS run against the hill shade TIFF is comparable to the run against the LiDAR GeoTIFF. Edges of the things we would normally be interested in are highlighted in the image. It does produce smaller highlighted areas on the hill shade versus the original LiDAR.

The obvious downside to these two techniques is that further processing has to be run to produce actual regions of interest. You would have to threshold the image into a binary image so you could generate contours. Then you could convert those contours into bounding boxes via other methods.

Object Saliency Method

BING is an actual object detector that uses a trained model to find objects in an image. While not as advanced as many of the modern methods, it does come from 2014 and can be considered the more advanced detection method available for images in OpenCV. In the config.py file, you can see that with BING, you also have to specify the path to the model that it uses for detection.

Here we see that BING found larger areas of interest than the static salience methods (SFT and BMS). While the static methods, especially BMS, did a decent job at detecting individual objects, BING generates larger areas that should be examined. Finally, let us run BING against the hill shade image.

Again we see that BING detected larger areas than the static methods. The areas are in fact close to what BING found against the LiDAR GeoTIFF.

Results

What can we conclude from all of this? First off, as usual, LiDAR is hard. Image processing methods to determine image salience can struggle with LiDAR as many areas of interest are not clearly delineated against the background like they would be in an image of your favorite pet. LiDAR converted to imagery can be chaotic and really pushes traditional image processing methods to the extremes.

Of all of the OpenCV methods to determine salience, I would argue that BMS is the most interesting and does a good job even on the original LiDAR vs the hill shade TIFF. If we go ahead and threshold the BMS LiDAR image, we can see that it does a good job of guiding us to areas we would find interesting in the LiDAR data.

The BING objectness model fares the worst against the test image. The areas it identifies are large parts of the image. If it were a bigger piece of data, it would basically say the entire image is of interest and not do a great job helping to narrow down where exactly a human would need to look. And in a way this is to be expected. Finding objects in LiDAR imagery is a difficult task considering how different the imagery is versus normal photographs that most models are trained on. LiDAR does not often provide an easy separation of foreground versus background. High-resolution data makes this even worse as things like a river bank can have many different elevation levels.

Next time we will look at modern deep learning-based methods. How will they fare? Will they be similar to the BING objectness model and just tell us to examine large swaths of the image? Or will they work similarly to BMS and guide us to more individual areas. We will find out next time.

Stupid LiDAR Tricks Part 1 (Segmentation)

Posted on May 12, 2024 by bigbubba

My last few posts have been about applying machine learning to try to extract geographic objects in LiDAR. I think now I would like to go in another direction and talk about ways to help us find anything in LiDAR. There is a lot of information in LiDAR, and sometimes it would be nice to have a computer help us to find areas we need to examine.

In this case I’m not necessarily just talking about machine learning. Instead, I am discussing algorithms that can examine an image and identify areas that have something “interesting” in them. Basically, trying to perform object detection without necessarily determining the object’s identity.

For the next few posts, I think I’ll talk about:

Selective Search from OpenCV
Image Saliency Detection
Segment Everything from Facebook
Others as I have time to talk about them.

I have a GitHub repository where I’ll stick code that I’m using for this series.

Selective Search (OpenCV)

This first post will talk about selective search, in this specific case, selective search from OpenCV. Selective search is a segmentation technique used to identify potential regions in an image that might contain objects. In the context of object detection, it can help to quickly narrow down areas of interest before running more complex algorithms. It performs:

Segmentation of the Image: The first step in selective search is to segment the image into multiple small segments or regions. This is typically done using a graph-based segmentation method. The idea is to group pixels together that have similar attributes such as color, texture, size, and shape.
Hierarchical Grouping: After the initial segmentation, selective search employs a hierarchical grouping strategy to merge these small regions into larger ones. It uses a variety of measures to decide which regions to merge, such as color similarity, texture similarity, size similarity, and shape compatibility between the regions. This process is repeated iteratively, resulting in a hierarchical grouping of regions from small to large.
Generating Region Proposals: From this hierarchy of regions, selective search generates region proposals. These proposals are essentially bounding boxes of areas that might contain objects.
Selecting Between Speed and Quality: Selective search allows for configuration between different modes that trade off between speed and the quality (or thoroughness) of the region proposals. “Fast” mode, for example, might be useful in cases of real-time segmentation in videos. “Quality” is used when processing speed is less important than accuracy.

Additionally. OpenCV allows you to apply various “strategies” to modify the region merging and proposal process. These strategies are:

Color Strategy: This strategy uses the similarity in color to merge regions. The color similarity is typically measured using histograms of the regions. Regions with similar colors are more likely to be merged under this strategy. This is useful in images where color is a strong indicator of distinct objects.
Texture Strategy: Texture strategy focuses on the texture of the regions. Textures are usually analyzed using local binary patterns or gradient orientations, and regions with similar texture patterns are merged. This strategy is particularly useful in images where texture provides significant information about the objects, such as in natural scenes.
Size Strategy: The size strategy prioritizes merging smaller regions into bigger ones. The idea is to prevent over-segmentation by reducing the number of very small, likely insignificant regions. This strategy tries to control the sizes of the region proposals, balancing between small regions with no areas of interest to large areas that contain multiple areas of interest.
Fill Strategy: This strategy considers how well a region fits within its bounding box. It merges regions that together can better fill a bounding box, minimizing the amount of empty space. The fill strategy is effective in creating more coherent region proposals, especially for objects that are close to being rectangular or square.

Selective Search in Action

Now let us take a look at how selective search works. This image is of a local celebrity called Gary the Goose. To follow along, see the selective_search.py code under the selective_search directory in the above GitHub repository.

Now let us see how selective search worked on this image:

Selective search on image with all strategies applied.

For this run, selective search was set to quality mode and had all of the strategies applied to it. As you can see, it found some areas of interest. It got some of the geese, a street sign, and part of a truck. But it did not get everything, including the star of the picture. Now let us try it again, but without applying any of the strategies (comment out line 95).

Default selective search with no strategies applied.

Here we see it did about the same. Got closer to the large white goose, but still seems to not have picked up a lot in the image.

Selective Search on LiDAR

Now let us try it on a small LiDAR segment. Here is the sample of a townhome neighborhood.

And here is the best result I could get after running selective search:

As you can see, it did “ok”. It identified a few areas, but did not pick up on the houses or the small creeks that run through the neighborhood.

Selective Search on a Hill Shade

Can we do better? Let us first save the same area as a hillshade GeoTIFF. Here we take the raw image and apply rendering techniques that simulate how light and shadows would interact with the three dimensional surface, making topographic features in the image easier to see. You can click some of the links to learn more about it. Here is the same area where I used QGIS to create and export a hill shade image.

You can see that the hill shade version makes it easier for a human to pick out features versus the original. It is easier to spot creeks and the flat areas where buildings are. Now let us see how selective search handles this file.

Selective Search run against a hill shade.

It did somewhat better. It identified several of the areas where houses are located, but it still missed all of the others. It also did not pick up on the creeks that run through the area.

Why Did It Not Work So Well?

Now the question you might have is “Why did selective search do so badly in all of the images?” Well, this type of segmentation is not actually what we would define as object detection today. It’s more an image processing operation that builds on techniques that have been around for decades that make use of pixel features to identify areas.

Early segmentation methods that led to selective search typically did the following:

Thresholding: Thresholding segments images based on pixel intensity values. This could be a global threshold applied across the entire image or adaptive thresholds that vary over different sized image regions.
Edge Detection: Edge detectors work by identifying boundaries of objects based on discontinuities in pixel intensities, which often correspond to edges. Some include a pass to try to connect edges to better identify objects.
Region Growing: This method starts with seed points and “grows” regions by appending neighboring pixels that have similar properties, such as color or texture.
Watershed Algorithm: The watershed algorithm treats the image’s intensity values as a topographic surface, where light areas are high and dark areas are low. “Flooding” the surface from the lowest points segments the image into regions separated by watershed lines.

Selective search came about as a hybrid approach that combined computer vision-based segmentation with strategies to group things together. Some of these were similarity measures such as color, texture, size, and fill to merge regions together iteratively. It then introduced a hierarchical grouping that built segments at multiple scales to try to better capture objects in an image.

These techniques do still have their uses. For example, they can quickly find objects on things like conveyor belts in a manufacturing setting, where the object stands out against a uniform background. However, they tend to fail when an image is “complicated”, like LiDAR as an example or a white goose that does not easily stand out against the background. And honestly, they are not really made to work with complex images, especially with LiDAR. These use cases require something more complex than traditional segmentation.

This is way longer now than I expected, so I think I will wrap this up here. Next time I will talk about another computer vision technique to identify areas of an interest in an image, specifically, image saliency.

Applying Deep Learning to LiDAR Part 4: Detection

Posted on April 24, 2024 by bigbubba

All of the previous parts of this series have talked about the challenges in training a CNN to detect geological features in LiDAR. This time I will talk about actually running the CNN against the test area and my thoughts on how it went.

Detection

I was surprised at how small the actual network was. The xCeption model that I used ended up only being around 84 megabytes. Admittedly this was only three classes and not a lot of samples, but I had expected it to be larger.

Next, the test image was a 32-bit single band LiDAR GeoTIFF that was around 350 gigabytes. This might not sound like much, but when you are scanning it for features, believe me it is quite large.

First off, due to the size of the image, and that I had to use a sliding window scan, I knew that the processing time would be long to run detections. I did some quick tests on subsections and realized that I would have to break up the image and run detection in chunks. This was before I had put a water cooler on my Tesla P40, and since I wanted to sleep at night, just letting it run to completion was out of the question. Sleep was not the only concern I had. I live south of the capital of the world’s last superpower, yet at the time we lost power any time it got windy or rained. The small chunks meant that if I lost power, I would not lose everything and could just restart it on the interrupted part.

I decided to break the image up into an 8×8 grid. This provided a size where each tile could be processed in two to three hours. I also had to generate strips that covered the edges of the tiles to try to capture features that might span two tiles. I had no idea how small spatially a feature could be, so I picked a 200×200 minimum pixel size for the sliding window algorithm. This still meant that each tile would have several thousand potential areas to run detections against. In the end it took several weeks worth of processing to finish up the entire dataset (keeping in mind that I did not run things at night since it would be hard to sleep with a jet engine next to you).

How well did it work? Well, that’s the interesting part. I’m not a geomorphologist so I had to rely on the client to examine it. But here’s an example of how it looks via QGIS:

As you can see, it tends to see a lot of areas as floodplain alluvium. After consulting with the subject matter experts, there are a few things that stand out.

The larger areas are not as useful as the smaller ones. As I had no idea of a useful scale, I did not have any limitations to the size of the bounding areas to check. However, it appears the smaller boxes actually do follow alluvium patterns. The output detections need to be filtered to only keep the smaller areas.
It might be possible to run a clustering algorithm against the smaller areas to better come up with larger areas that are correctly in the class.

Closing thoughts and future work

While mostly successful, as I have had time to look back, I think there are different or better ways to approach this problem.

The first is to train on the actual LiDAR points versus a rasterization of them. Instead of going all the way to rasterization, I think keeping the points that represent ground level as inputs to training might be a better way to go. This way I could alleviate the issues with computer vision libraries and potentially have a simpler workflow. I am curious if geographic features might be easier for a neural network to detect if given the raw points versus a converted raster layer.

If I stay with a rasterized version, I think if I did it again I would try one of the YOLO-class models. These models are state-of-the-art and I think may work better in scanning large areas for smaller scale features as it does its own segmentation and detection. The only downside to this is I am not entirely sure YOLO’s segmentation would identify areas better than selective search due to the type of input data.

I think it would also be useful to revisit some of the computer vision algorithms. I believe selective search could be extended to work with higher numbers of bits per sample. Some of the other related algorithms could likely be extended. This would help in general with remotely sensed data as it usually contains higher numbers of bits per sample.

While there are a lot of segmentation models out there, I am curious how well any of them would work with this type of data. Many of them have the same limitations as OpenCV does and cannot handle 32-bits per sample imagery. These algorithms typically images where objects “stand out” against the background. LiDAR in this case is much different than the types of sample data that such images were trained on. For example, here is a sample of OpenCV’s selective search run against a small section of the test data. The code of course has to convert the data to 8-bits/sample and convert it to a RGB image before running. Note that this was around 300 meg in size and took over an hour to run on my 16 core Ryzen CPU.

You can see that selective search seems to have trouble with this type of LiDAR as there are not anything such as house lots that could be detected. The detections are a bit all over the place.

Well that’s it for now. I think my next post will be about another thing I’ve been messing with: applying image saliency algorithms to LiDAR just to see if they’d pull anything out.

Applying Deep Learning to LiDAR Part 3: Algorithms

Posted on March 31, 2024 by bigbubba

Last time I talked about the problems finding data and in training a machine learning model to classify geologic features from LiDAR. This time I want to talk about how various libraries can (and cannot) handle 32-bit imagery. This actually caused most of the technical issues with the project and required multiple work-arounds.

OpenCV and RasterIO

OpenCV is probably the most widely used computer vision library around. It’s a great library, but it’s written to assume that the entire image can be loaded into memory at once. To get around this, I had to use the rasterio library as it will read on demand and let you easily read in parts of the image at a time. To use it with something like Tensorflow, you have to change the data with some code like this:

with rasterio.open(in_file) as src:
    # Read the data as a 3D array (bands, rows, columns)

    # Convert the data type to float32
    data = data.astype(numpy.float32)

    # Transpose the array to match the shape of cv2.imread (rows, columns, bands)
    data = numpy.transpose(data, (1, 2, 0))

    return data

Many computer vision algorithms are designed to expect certain types of images, either 8 to 16-bit grayscale or up to 32-bit three channel (such as RGB) images. OpenCV, one of the most popular, is no different in this aspect . The mathematical formulas behind these algorithms have certain expectations as well. Sometimes they can scale to larger numbers of bits, sometimes not.

Finding Areas of Interest

This actually impacts how we search the image for areas of interest. There are typically two ways to search an image using computer vision: sliding window and selective search. A sliding window search is a technique used to detect objects or features within an image by moving a window of a fixed size across the image in a systematic manner. Imagine looking through a small square or rectangular frame that you slide over an image, both horizontally and vertically, inspecting every part of the image through this frame. At each position, the content within this window is analyzed to determine whether it contains the object or feature of interest.

Selective Search is an algorithm used in computer vision for efficient object detection. It serves as a preprocessing step that proposes regions in an image that are likely to contain objects. Instead of evaluating every possible location and scale directly through a sliding window, Selective Search intelligently generates a set of region proposals by grouping pixels based on similarity criteria such as color, texture, size, and shape compatibility.

Selective search is more efficient than a sliding window since it returns only “interesting” areas of interest versus a huge number of proposals that a sliding window approach uses. Selective search in OpenCV is only designed to work with 24 bit images (ie, RGB images with 8 bits per channel). To use higher-bit data with it, you would have to scale it to 8 bits/channel. A 32-bit dataset (which includes negative values as these typically indicate no-data areas) can represent 2.15 billion distinct values. To scale to 8 bits per channel, we would also need to convert it from floating point to 8-bit integer values. In this case, we can only represent 256 discrete values. As you can see, this is quite a difference in how many elevations we can differentiate.

Here’s an example of the areas of interest that a sliding window and image pyramid generates. As you can see, there are a lot of regions of interest that are regularly placed across the image.

However, selective search is not always perfect. Below is an example where I ran OpenCV 4’s selective search against an image of mine. It generated 9,020 proposed areas to search. I zoomed in to show it did not even show the hawk as a region of interest.

Selective search output run against an image with a hawk.

Here’s a clipped version of the input dataset when viewed in QGIS as a 32-bit DEM. Notice in this case the values range from roughly 1,431 to 1,865.

QGIS with a clip of the original dataset.

Now here is a version converted to the 8-bit byte format in QGIS.

As you can see, there is quite a difference between the two files. And before you ask, int8 just results in a black image no matter how I try to adjust the no-data value.

Tensorflow tf.data Pipeline

So to run this, I set up a Tensorflow tf.data pipeline for processing. My goal was to be able to turn any of the built-in Tensorflow models into a RCNN. An interesting artifact of using built-in models, Tensorflow, and OpenCV was that the input data actually had to be converted into RGB format. Yes, this means a 32-bit grayscale image had to become a 32-bit RGB image, which of course greatly increased the memory requirements. Here’s a code snippet that shows how to use Rasterio, PIL, and numpy to take an input image and convert it so it’s compatible with the built-in Tensorflow models:

def load_and_preprocess_32bit_image(image_bytes: tensorflow.string) -> numpy.ndarray:
    """Helper function to preprocess 32-bit TIFF image
    Args:
       image_bytes (tensorflow.string): Input image bytes
    Returns:
        numpy.ndarray: decoded image
    """

    with rasterio.io.MemoryFile(image_bytes) as memfile:
        with memfile.open() as dataset:
            image = dataset.read()
    
    image = Image.fromarray(image.squeeze().astype('uint32')).convert('RGB')
    image = numpy.array(image)  # Convert to NumPy array
    image = tensorflow.image.resize(image, local_config.IMAGE_SIZE)

    return image

This function takes the 32-bit DEM, loads it, converts it to a 32-bit RGB image, and then converts it to a format that Tensorflow can work with.

You can then create a function that can use this as part of a tf.data pipeline by defining a function such as this:


def load_and_preprocess_image_train(image_path, label, in_preprocess_input,
                                    is_32bit=False):
    """ Define a function to load, preprocess, and augment the images
    Args:
        image_path (_type_): Path to the input image
        label (_type_): label of the image
        in_reprocess_input: Function from keras to call to preprocess the input
        is_32bit (bool, optional): Is the image a 32 bit greyscale. Defaults to 
                                   False.

    Returns:
     _type_: Pre-processed image and label
    """

    image = tensorflow.io.read_file(image_path)

    if is_32bit:
        image = tensorflow.numpy_function(load_and_preprocess_32bit_image, 
                                          [image],
                                          tensorflow.float32)
    else:
        image = tensorflow.image.decode_image(image, 
                                              channels=3,
                                              expand_animations=False)
        image = tensorflow.image.resize(image, local_config.IMAGE_SIZE)
     
    image = augment_image_train(image)  # Apply data augmentation for training
    image = in_preprocess_input(image)

    return image, label

Lastly, this can then be set up as a part of your tf.data pipeline by using code like this:

# Create a tf.data.Dataset for training data
train_dataset = tf.data.Dataset.from_tensor_slices((train_image_paths, train_labels))
train_dataset = 
    train_dataset.map(lambda path, label:
        image_utilities.load_and_preprocess_image_train(path,
                                                        label,
                                                        preprocess_input,
                                             is_32bit=local_config.USE_TIF,
                                             num_parallel_calls=tf.data.AUTOTUNE)

(Yeah trying to format code on a page in WordPress doesn’t always work so well)

Note I plan on making all of the code public once I make sure the client is cool with that since I was already working on it before taking on their project. In the meantime, sorry for being a little bit vague.

Training a Model to be a RCNN

Once you have your pipeline set up, it is time to load the built-in model. In this case I used Xception from Tensorflow and used the pre-trained model to do transfer learning by the standard omit the top layer, freeze the previous layers, then add a new layer on top that learns from the input.

# Load the model without pre-trained weights
base_model = Xception(weights=local_config.PRETRAINED_MODEL, 
                      include_top=False, 
                      input_shape=local_config.IMAGE_SHAPE,
                      classes=num_classes, input_tensor=input_tensor)

# Freeze the base model layers if we're using a pretrained model

if local_config.PRETRAINED_MODEL is not None:
     for layer in base_model.layers:
         layer.trainable = False

# Add a global average pooling layer
x = base_model.output
x = GlobalAveragePooling2D()(x)

# Create the model
predictions = Dense(num_classes, activation='softmax')(x)
model = Model(inputs=base_model.input, outputs=predictions)

In this case, I used Adam as the optimizer as it performed better than something like the stock SGD and I added in two model callbacks. The first saves the model to disk every time the validation accuracy goes up, and the second stops processing if the accuracy hasn’t improved over a preset number of epochs. These are actually built-in to Keras and can be set up as follows:

# construct the callback to save only the *best* model to disk based on 
# the validation loss
model_checkpoint = ModelCheckpoint(args["weights"], 
                                   monitor="val_accuracy", 
                                   mode="max", 
                                   save_best_only=True,
                                   verbose=1)

# Add in an early stopping checkpoint so we don't waste our time
early_stop_checkpoint = EarlyStopping(monitor="val_accuracy",
                                      patience=local_config.EPOCHS_EXIT,
                                      restore_best_weights=True)

You can then add them to a list with

model_callbacks = [model_checkpoint, early_stop_checkpoint]

And then pass that into the model.fit function.

After all of this, it was a matter of running the model. As you can imagine, training took several hours. Since this has gotten a bit long, I think I’ll go into how I did the detection stages next time.