Data Science

KMP Algorithm | Knuth Morris Pratt Algorithm

Last Updated: 17th April, 2024

Arunav Goswami

Data Science Consultant at almaBetter

Explore the intricacies of the KMP algorithm. Discover its efficient string matching process, practical applications, and significant benefits.

In the realm of computer science, the Knuth-Morris-Pratt (KMP) algorithm stands as a testament to the evolution of string matching techniques. This powerful algorithm, named after its inventors Donald Knuth, Vaughan Pratt, and James H. Morris, revolutionized the way we approach string matching, making it more efficient and time-saving. The KMP algorithm is especially beneficial in fields such as data retrieval, DNA sequence analysis, and text editing tools. In this comprehensive guide, we will delve deep into the workings of the KMP algorithm, its applications, and the advantages it offers over traditional search methods.

Introduction to String Matching

Before understanding the KMP algorithm, it is essential to grasp the concept of string matching. String matching is a fundamental problem in computer science where one seeks to find the occurrence(s) of a substring (pattern) within a main string (text). The brute-force approach to this problem involves checking every position in the text for a potential match of the pattern, leading to a time-consuming process, especially with longer texts and patterns.

The Genesis of the KMP Algorithm

The KMP algorithm was developed to enhance the efficiency of the string matching process. It eliminates the need for backtracking by pre-processing the pattern to determine matching segments. This pre-processing step creates a partial match table (also known as the "failure function") that guides the algorithm in skipping over positions in the text where a match is impossible, thereby reducing the number of comparisons needed.

What is KMP algorithm?

The Knuth Morris Pratt algorithm is a string matching algorithm that searches for occurrences of a "word" W within a main "text string" S in O(n+m) time, where n is the length of S and m is the length of W. The crux of the KMP algorithm lies in its ability to memorize the matches of the pattern within the text.This efficiency is achieved by precomputing a table of how far the search position should jump ahead when a mismatch occurs. The key idea is to avoid redundant checking of characters in S that have already been matched against W.

The main steps of the KMP algorithm are:

Preprocessing Phase: Before the actual search begins, the algorithm preprocesses W to create a partial match table (also known as the "failure function" or "prefix table"). This table is used to determine how far to skip ahead in the string S after a mismatch. The preprocessing focuses on finding the longest proper prefix which is also a suffix in the subpatterns of W.
Searching Phase: The algorithm then begins to search for W in S by matching characters from the beginning. When a mismatch occurs, the partial match table is used to find out how many characters can be safely skipped. Instead of starting the search anew from the next character, the algorithm uses the information in the partial match table to skip non-viable positions, thereby reducing the number of comparisons needed.

The advantage of KMP over simpler string matching algorithms is its ability to complete the search in O(n+m) time, which means the kmp algorithm time complexity does not depend on the number of character comparisons directly but rather on the length of the text and the word being searched. This makes KMP especially useful for searching in large texts or for applications where the same word is searched many times in different contexts.

KMP Algorithm Example

Let's go through an example of how the Knuth-Morris-Pratt algorithm works by searching for the word "ABCDABD" within the text "ABC ABCDAB ABCDABCDABDE". The key to understanding the KMP algorithm is to grasp how the partial match table (also known as the "failure function") is constructed and then used to optimize the search by skipping unnecessary comparisons.

Step 1: Constructing the Partial Match Table

The partial match table tells us, for each position in the pattern (word) "ABCDABD", how far back we should jump if a mismatch happens at that position. The value at each position i in the table is the length of the longest proper prefix of the substring (ending at position i) which is also a suffix of this substring.

For "ABCDABD":

At position 0 (A), there are no proper prefixes, so the value is 0.
At position 1 (B), the only proper prefix is "A" which is not a suffix, so the value is 0.
At position 2 (C), the proper prefixes are "A", "AB", and none of them are suffixes, so the value is 0.
At position 3 (D), similar to above, the value is 0.
At position 4 (A), the proper prefixes include "A", "AB", "ABC", "ABCD", and "A" is a suffix of "ABCD", so the value is 1.
At position 5 (B), the proper prefixes include "AB", which is a suffix of "ABCDA", so the value is 2.
At position 6 (D), there are no matching suffixes in "ABCDAB", so the value is 0.

Thus, the partial match table is: 0, 0, 0, 0, 1, 2, 0.

Step 2: The Search Process

Start with the text "ABC ABCDAB ABCDABCDABDE" and the pattern "ABCDABD".
The search begins at the start of the text. "ABC" matches "ABC", but " " (space) does not match "D".
According to the partial match table, after a mismatch at "D" (position 3 in the pattern), we don't need to go back completely, but since all the values till "D" are 0, we just move the pattern one step forward.
We continue this process, moving the pattern along the text, utilizing the partial match table to determine how much to slide the pattern over the text each time a mismatch occurs.
When we reach the position in the text "ABC ABCDAB ABCDABCDABDE" that aligns with "ABCDABD", we've found a match starting at position 15 in the text (counting from 0).

This simplified example illustrates the power of the KMP algorithm: by using the partial match table to intelligently skip sections of the text that don't need to be checked again, it significantly reduces the number of character comparisons needed to find a match, leading to more efficient search operations, especially with larger texts and patterns.

KMP Algorithm Code

Below is a Python implementation of the KMP algorithm, which includes both the creation of the partial match table and the search process. This code will search for a word within a text and return the starting index of the first occurrence of the word within the text. If the word is not found, it returns -1.

def kmp_search(text, pattern):
    """Searches for the pattern in the text using the KMP algorithm."""
    # Part 1: Create the partial match table
    partial_match_table = [0] * len(pattern)
    j = 0  # length of the previous longest prefix suffix
    
    # Calculate partial_match_table[1:] (note: the first entry is always 0)
    for i in range(1, len(pattern)):
        # Update j as long as there's a mismatch
        while j > 0 and pattern[j] != pattern[i]:
            j = partial_match_table[j-1]
        
        # If there's a match, increment j and update the table
        if pattern[j] == pattern[i]:
            j += 1
        partial_match_table[i] = j

    # Part 2: Search for the pattern in the text using the partial match table
    i = j = 0  # index for text[], index for pattern[]
    while i < len(text):
        if pattern[j] == text[i]:
            i += 1
            j += 1
        if j == len(pattern):
            return i - j  # Match found; return start index of the match in text
        # Mismatch after j matches
        elif i < len(text) and pattern[j] != text[i]:
            # Do not match partial_match_table[0..partial_match_table[j-1]] characters
            # they will match anyway
            if j != 0:
                j = partial_match_table[j-1]
            else:
                i += 1
    return -1  # No match found

text = "ABC ABCDAB ABCDABCDABDE"
pattern = "ABCDABD"
match_index = kmp_search(text, pattern)
print("Pattern found at index:", match_index)

This code defines a kmp_search function that first constructs the partial match table for the given pattern. It then uses this table to efficiently search through the text. When it finds a match, it returns the starting index of that match; otherwise, it returns -1.

Practical Applications of the KMP Algorithm

The KMP algorithm's efficiency and effectiveness make it invaluable in various practical applications. Some of the notable ones include:

Data Retrieval Systems: It speeds up the search for specific strings within large databases.
Text Editing Software: Enables quick search-and-replace functions.
Bioinformatics: Assists in DNA sequencing by finding specific sequences within larger genomic data.
Network Security: Helps in pattern matching for intrusion detection systems.

Conclusion

The Knuth-Morris-Pratt algorithm is a cornerstone in the field of computer science, particularly in the domain of string matching. Its innovative approach to eliminating inefficient backtracking has paved the way for faster and more reliable search techniques. As we continue to delve into the depths of data and require more efficient ways to navigate through information, the KMP algorithm remains a critical tool in our arsenal, underscoring the importance of algorithmic efficiency in the digital age.

If you're interested in learning more about the KMP algorithm, consider exploring our Python tutorial or enrolling in our data science online course!