Some Thoughts on Creating Machine Learning Data Sets

Posted on April 26, 2023 by bigbubba

My whole professional career (26 years now!) has been in slinging data around. Be it writing production systems for the government or making deep learning pipelines, I have made a living out of manipulating data. I have always taken pains to make sure what I do is correct and am proud of how obsessive I am about putting things together.

Lately I have been back to doing deep learning and computer vision work and have been looking at some of the open datasets that are out there for satellite imagery. Part of this work has involved trying to combine datasets to train models for object classification from various satellite imagery providers.

One thing I have noticed is that it is still hard to make a good deep learning dataset. We are only human, and we miss things sometimes. It is easy to misclassify things or accidentally include images that might not be a good candidate. Even ImageNet, one of the biggest computer vision and deep learning datasets out there, has been found to contain errors in the training data.

This got me thinking about putting together datasets and what “rules of thumb” I would use for doing so. As there do not seem to be as many articles out there about making datasets versus using them, I thought I would add my two cents on making a good machine learning dataset. This is by no means a criticism about the freely-available datasets out there now. In fact, we should all be thankful so many people are putting them out there, and I hope to add to these soon! So, in no particular order, here we go.

Limit the Number of Object Classes

One thing I have noticed is many datasets try to break out their object classes into too fine of detail. This looks to be especially true for satellite datasets. For example, one popular dataset breaks aircraft down into multiple classes ranging from propeller to trainer to jet aircraft. One problem with this approach is that it becomes easy to pick the wrong category while classifying them. Jet trainers are out there. Should that go into the jet category, or the trainer category? There are commercial propeller aircraft out there. What do we do with them?

It is one thing if you are purposely trying to differentiate different types of aircraft from satellite imagery. Even then, however, I would guess that a neural network would learn mostly the same features for each object class, and you would end up getting the aircraft category correct but have numerous misclassifications for the type. It will also be a lot easier to accidentally mislabel things from imagery while you are building the dataset. Now this is different if you are working with imagery to which only certain types of three letter agencies have access. But, most of us have to make due with the types of imagery that are freely available.

Verify Every Dataset Before Use

We all get into a hurry. We have deadlines, our projects are usually underfunded, and we spend large amounts of time putting out fires. It is still vitally important to not blindly use a dataset to train your model! Consider the below image. In one public domain dataset out there, around eight of these images are all labeled as boats.

This is a classical example of “we’re only human.” I would wager that a lot of datasets contain errors such as this. It is not that anyone is trying to purposely mislead you. It just happens sometimes. This is why it is essential to go through your dataset no matter how boring it might be. Labeling is a tough job, and there is probably a room in Hell that tortures people by making them label things all day and night. Again, everyone makes mistakes.

Clean up Your Datasets

Some datasets out there are derived from Google Earth. It is a source of high quality imagery and for now the terms seem to not require your firstborn or an oath of loyalty. The problem comes when you include things like the image below in your training set.

Here you can see that someone used an aircraft that had the Google Earth watermark superimposed on top of it. If you only have one or two, it likely will not be an issue. However, if you have a lot of things like this, then your network could learn features from the text and start to expect imagery to have this in it. This is an example of where you should practice due diligence in cleaning up your data before putting it into a machine learning dataset.

Avoid Clusters (Unless Looking for Them)

When extracting objects from Google Earth, you might come across something you consider a gold mine. Say you are extracting and labeling cars from imagery, and you come across a full parking lot. This seems like an easy way to suddenly add a lot of training data, and you might end up with a lot of images such as the one below.

In a dataset I recently worked with, there were several object classes that had a lot of images of the same types of objects right next to each other. If you do this, keep in mind that there could be some side effects from having a lot of clusters in your dataset.

It could (possibly) be helpful because your object class can present different views if they are present in a cluster. In the above, you can see the top of one bus and the back of another right beside it. This can aid in learning different features for that class.
A negative is that if you have a lot of clusters of objects in your training data, your classifier might lean towards detecting objects in groups instead of individual objects. If you are specifically looking for clusters then this is OK. If you want to search for individual ones, then it could hurt your accuracy.
The complexity of your training data could increase and lead to slow-downs during the training process by having more features present for each example object than would be present ordinarily.

In general, it is OK to have some clusters of objects in your data. Just be mindful that if you are looking for individual objects, you should try not to have a lot of clusters in your training data.

Avoid Duplicate Objects

This is true for objects in the same class or between object classes. If you have a lot of duplicates, you can run the risk of over-fitting during training. Problems can also arise if you, say, have a car in both a car object class and in a truck class. In this case you can end up with false detections because you accidentally matched the same features in multiple classes.

Pick Representative Training Data

If you are trying to train a model to detect objects from overhead imagery, you would not want to use a training set of pictures people have taken from their cellular phones. Similarly, if you want to predict someone’s mood from a high resolution camera, you would not want to train using fuzzy webcam images. This is stressed in numerous deep learning texts out there, but it is important to repeat. Check if the data you are using for training matches the data you intend to be running your model on. Fitness for use is important.

Invariance

The last thing I want to discuss here is what I feel is one of the most important aspects of deep learning: invariance. Convolutional neural networks are NOT invariant to scale and rotation. They can detect translations, but you do not get scale or rotational invariance by default. You have to provide this using data augmentation during your training phase.

Consider a dataset of numbers that you are using to train a model. This dataset will likely have the numbers written as you would expect them to be: straight up and down. If you train a model on this, it will detect numbers that match a vertical orientation, but accuracy will go down for anything that is off the vertical axis.

Frameworks like Keras make this easy by providing classes or functions that can randomly rotate, shear, or scale images during training before they are input into the network. This helps the classifier learn features in different orientations, something that is important for classifying overhead imagery.

Conclusion

In summary, these are just some general guidelines I use when I am putting together a dataset. I do not always get everything right, and neither will you. Get multiple eyes looking at your dataset and take your time. Labeling is a laborious, and if your attention drifts it is easy to get wrong. The better quality your training data, the better your model will perform.

Finally Upgraded!

Posted on April 26, 2023 by bigbubba

If you’ve been trying to come here over the past few days, you might have noticed that this blog has been up and down, changing themes, and what not. I have been having issues upgrading the PHP version on this website and finally got things ironed out thanks to my provider’s awesome support staff! So I promise it should be back to normal now. Mostly. Probably. 😉

Image Processing Basics Part 2

Posted on April 18, 2023 by bigbubba

Some Examples

Now that we have some of the basics down, let us look at some practical examples of the differences between how the brain sees things versus how a computer does.

The above photo of a part of the sky was taken by my iPhone 13 Pro Max using the native camera application. There were no filters or anything else applied to it. To our eyes, it looks fairly uniform: mainly blue with some lighter blue towards the right where the sun was the day I took the picture. Each pixel of the image represents the light that hit a sensor in the camera, was processed, and saved.

Our brain does not see a number of individual pixels. Instead, we see large splotches of colors. This is one of the shortcuts our brain does to ease the processing burden. If you look around a room, you do not see individual differences between the colors of the wall. Your wall mainly looks like a uniform color. We simply do not have the processing power to break down the inputs from our eyes into every minute part.

A computer, however, does have the ability to “see” an image in all of its different parts. Computers see everything as a number, be it the 1’s and 0’s of binary or color triplets in the RGB color space. If we look at the RGB color cube below, the computer sees all of the pixels in the above image as clustering somewhere around the lower right side of the cube. See the previous link for more information about the RGB color space.

RGB Color cube from wikipedia — RGB Color Cube (Wikimedia Commons contributors, “File:RGB color solid cube.png,” *Wikimedia Commons,* https://commons.wikimedia.org/w/index.php?title=File:RGB_color_solid_cube.png&oldid=656872808 (accessed April 18, 2023).

In a computer, the above image is loaded and each pixel is in memory in the form of triplets such as (135, 206, 235), which is the code for a color known as sky blue. The computer also does not have to take any shortcuts when it loads the image, meaning that the representation in memory is exactly the same as the image that was saved from the phone.

If we use the OpenCV library to calculate the histogram of the image and then count the number of colors, we in fact find that there are 2,522 unique colors in the picture of the sky. There is no magic here, we just do not have the same precision that a computer does when it comes to examining images or our environment. The big take away here is this: there is more information encoded in pictures or video than what our brains are capable of perceiving. Just because we cannot see certain details in a image does not mean that they are not there.

For another example, consider this image below. The edges look like nothing but black, and all you can really see is out of the window. It is definitely underexposed.

Photo out the window of my wife's grandparents' house. — Photo out the window of my wife’s grandparents’ house.

As mentioned above, a computer is able to detect more than our eyes can. Where we just see black around the edges, there is in fact detail there. We can adjust the exposure on the image to brighten it so that our eyes can see these details.

Above image with the exposure and contrast adjusted

With the exposure turned up (and adjusting the contrast as well), we can additionally see a picture of a bird, some dishes, and some cooking implements. This is not magic, nor is it adding anything to the image that was not already there. Image processing like this does not insert things into an image. It only enhances the details of an image so that they are more detectable to the human eye.

Many times, when image processing is in the news, people sometimes assume that it changing an image, or that it is inserting things that were not originally there. When you edit your images on your phone or tablet, you are manipulating the detail that is already in the image. You can enhance the contrast to make the image “pop.” You can change the color tone of the image to make it appear more warm or more cold to your liking. However, this is simply modifying the information that is already in the image to change how it appears to the human eye.

I am making a big deal about this point as future installments in this series will demonstrate how things actually work while hopefully dispelling certain myths that exist in pop culture. I think next time I will cover zooming in or out of an image (aka, resizing). Does it add something into the image or misrepresent it? We will find out.

Brian's Geek Blog

My Home on the Internet

Monthly Archives: April 2023