Some Background on Natural Language Processing – Part 2: Older Methods and Techniques

Last time we learned why English is a hard language, both for humans and especially for computers.  For this time I think we will look at the past of NLP to understand the present. It actually has been an interesting history to get to where we are now.

Historic NLP methods relied on approaches that focused more on linguistics and statistical analysis.  Computers were not as powerful as they are now, and definitely did not have GPUs with the crazy amount of processing capability that they have now.  They can be broadly grouped into six categories: rule-based systems, bag of words, term frequency-inverse document frequency, n-grams, Hidden Markov Models, and Support Vector Machines.  Let us explore these methods to see how they worked and led to today’s more powerful systems.  This post will have a lot of links to other sources of information as it is intended to be more of a gentle introduction than an exhaustive analysis.

Rule-Based Systems

Rule-Based Systems are perhaps the oldest method of trying to get a computer to be able to process a language.  These systems were built on predefined linguistic rules and handcrafted grammars.  Grammar in this case is not about usage, such as matching nouns and verbs and what not.  In this case the grammar is known as a formal grammar that is a mathematical structure of a sentence.  One specific method is a context-free grammar that has a set of rules of how to form sentences from smaller words and phrases.

Rule-based systems explicitly encode language in a series of rules so that text can be parsed.  This parsing would then be able to identify parts of speech and identify sentence structure.  These rules were typically created by linguists and software developers working together to try to codify language as a series of mathematical rules.

The rules they would create were simple so that they could be combined together for more complex sentences.   For example, one rule could be “any word that ends in ing is likely a verb”, while another could be “a noun follows determiner words such as a, then, and an.”  The developer would then take these rules to create a parser to take in sentences and break them down into their syntactic components using the mathematically-defined grammars.

Since I am a cat person, let us look at this very simple example sentence: “My cat is playing with a toy.”  In this case, the would be tagged as a determiner words, so cat would be tagged as a noun.  Is playing would be identified as a verb.  A is another determiner word, so toy would be tagged as another noun.

While this sounds simple, these systems were complex and labor intensive.  They required the work of expert linguists and developers to build the parsing systems.  They required a lot of maintenance as problems were found and the rules needed to be modified.  As they were based on formal mathematics, they were not very flexible when it comes to something sloppy like English.  This meant that rules after rules would have to be developed and stacked on top of one another.

Bag of Words

The Bag of Words (BoW) method was an early attempt at applying statistical processing to NLP.  It sought to represent text in a numerical form so that statistical algorithms could be applied.  These algorithms could then do things like classify documents or determine the similarity of two different documents.

The BoW method works by treating the input text as an unordered collection of words.  In this case, however, neither the order of words nor the sentence grammar are used.  The document is tokenized, where the words are converted into tokens.  A list of each unique token is then created.  The input text is analyzed and each word is counted to determine the frequency of occurrence in the document.  The document is then converted into a vector of word frequencies that is the same size as the number of unique words in the document.

You may know vectors from geometry where they represent lines and their direction.  It is still similar in this case where it is a mathematical array of numbers.  In the case of a frequency representation, the array could look something like [10, 5, 3, 9] and so on, where each number represents the frequency a word appears.

For computing document similarity, both documents would be turned into frequency vectors by enumerating their unique words and then creating a vector that is the same length as the total number of unique words.  The frequencies of each word are inserted into the array, with a zero placed in locations where a word occurs in one document but not in the other.  The order of the word frequencies of the array can be sorted so each array entry corresponds to the same one in the other document.

The comparison is easier to think of back in terms of geometry.  There are two methods to compute the distance between the vectors: cosine similarity and Euclidean Distance.  For cosine similarity, the cosine of the angle between the two vectors in multidimensional space is calculated with the result being between -1 and 1.  A value of 1 means the vectors are identical, 0 means they are orthogonal to each other (not similar), and a -1 means they are very different from each other, and other values are between one of the three.

Euclidean Distance calculates the straight-line distance between each point in the vectors in Euclidean space (told you geometry would come into it).  The distance metric is actually based on the Pythagorean Theorem that you may remember from high school.  The output of this is a vector that is the same length as the document vectors and each entry is the distance between the corresponding points.  The smaller the distances, the more similar the documents are.

While BoW is good for things like document similarity, it has several drawbacks when compared to other NLP techniques.  For one, since it just deals with statistical frequencies, it ignores things such as order of words and their context.  This means there is no real semantic information that is carried over.  The other drawback is that for large documents that are very different, the generated vectors can be largely empty and become cumbersome for distance computation algorithms.

Term Frequency-Inverse Document Frequency

Term Frequency-Inverse Document Frequency (TF-IDF) came about as an improvement on the BoW concept by weighing words based on importance in a document relative to a larger group of documents.  It attempts to identify significant words in a document while ignoring common words that frequently occur, such as the, and, but, and so on.

To understand how it works, let us break down the various parts of how it works.  The first part is the Term Frequency (TF).  As it sounds, this is a measure of how frequently a word appears in a document.  Thus words that occur more will have a higher score.  

 

 

The next part of the calculation is the Inverse Document Frequency (IDF).  This equation is a measure of how often a word occurs across the entire set of documents.  Words that do not appear often are assigned a higher score.

 

 

Now we multiply the two values together to come up with the TF-IDF value.

 

 

This value reflects how “important” a word is in a document relative to the total number of documents.  Word that occur often in one document but not in the total number of documents would end up with a high TF-IDF score.  This can be use to filter out the common words I mentioned above.

TF-IDF also has several drawbacks when it comes to NLP.  For one, it still ignores word order and the context in which words are used.  As previously mentioned, word order and context are very important when attempting to determine the actual meaning of a word and how it is used.

This leads into the next issue with TF-IDF.  Without context, it fails to account for words that mean the same thing.  Consider something like dog and puppy.  TF-IDF would treat them as different words instead of recognizing they are similar.

N-Grams

N-Grams were created to try to address the limitations of lack of word context.  They consider sequences of words to preserve word order versus looking solely at single words.  This consideration made N-Grams popular for modeling textual language and actual text generation.

An N-Gram is defined as a contiguous sequence of N items that are in text.  Consider the example sentence of “The cat sleeps on the chair”.  A unigram (1-Gram) would be a single word such as “The” and “cat”.  A bigram (2-Gram) is a sequence of two words, such as “The cat” and “cat sleeps”.  A trigram (3-Gram) is then a sequence of three words, such as “The cat sleeps” or “cat sleeps on”.  This continues on for larger numbers of N and counts the frequency that these sequences occur in a text.  As such, the higher the value of N, the more context is captured from the text.

You might immediately see a problem with N-Grams.  As the value of N increases, the number of possible word sequences goes up nearly exponentially.  For example, a vocabulary of V would have approximately VN N-Grams.  This makes the model much harder to train when there is a limited amount of text.  Higher values of N also can cause a smaller number of matches across documents.  N-Grams still fail to capture dependencies across sentences as they only capture localized context.

Hidden Markov Models

Hidden Markov Models (HMMs) were used quite a bit (and periodically used today) for NLP tasks like part-of-speech tagging and named entity recognition (NER).  They are probability models that represent the sequence of hidden states that exist in systems that are based on observable events.

A HMM consists of several parts:

  • Hidden States represent the phenomena of a system that are unobservable.  In relation to NLP, these hidden states could be parts of speech.
  • Observations are the words or their tokenization from text that are observed directly.
  • Transition probabilities are the probability of moving from one hidden state to another.  In NLP, nouns are usually followed by verbs, so that the probability of going from a noun to a verb is very high.
  • Emission probabilities are the probabilities of observing a particular word given a particular hidden state.  For example, if the hidden state is a verb, then the probability of the word sleep given the verb state would be high.

HMMs work by assuming that the system being modeled (not necessarily just for NLP work) is a Markov Process.  Put simply, a Markov Process means that the probability of each state depends only on the previous state.  With respect to NLP, the hidden states can be considered to represent parts of speech, such as nouns and verbs.  The observed events of the system would be the words themselves.  A HMM then uses the above probabilities to model the sequence of hidden states that most likely produced the observable text.

HMMs require training data to be able to predict the parts of speech in a sentence.  This is also their drawback in that they require manually labeled training data to be able to estimate probabilities.  

Support Vector Machines

The last “old school” NLP technique we will discuss is Support Vector Machines (SVM).  They were one of the first machine learning algorithms used for classification tasks.  SVMs were useful for tasks such as sentiment analysis and spam detection.  Much like machine learning tasks today, they worked by finding a hyperplane that separates different classes in high-dimensional spaces (yes this is hard to wrap your mind around, I would suggest reading the links to learn more).

Machine learning is based on vectors, or arrays of values.  SVMs are no exception.  They work by encoding the text document as a numerical vector such as TF-IDF values or others.  They then search to find an optimal plane / boundary that can completely separate multiple classes (for example, text that speaks positively about something versus negatively about it).  Kernel functions were used in cases where the plane that separated the classes was non-linear.  They would map the input space into another space that would allow the classes to be divided by a hyperplane.  This early machine learning method was very useful for classification tasks.

While they were good for classification tasks, they were not good at modeling sequences or structures in texts.  Outside of classification they did not really perform as well.  As is often the case with machine learning, large scale datasets were computationally expensive to train.

Conclusion

This basically sums up some of the classification methods for NLP.  You can see how things developed from statistical analysis to the beginnings of machine learning.  These methods were good at some things, but failed to capture the complexity and context of languages.  They heavily relied on preprocessing steps such as tokenization, common and stop word removals, manual labeling, and so on.  These steps could be time consuming and still not capture meaning.

The move to more machine learning methods would finally enable better handling of ambiguity and context in various languages.  Next time we will close with modern techniques at NLP and how they have created a revolution in processing human languages.