String Matching Algorithm

Last Updated on June 29, 2023 by Mayank Dham

String matching algorithms are fundamental tools in computer science and are widely used in various applications such as text processing, data mining, information retrieval, and pattern recognition. These algorithms aim to locate occurrences of a pattern within a larger text or string. In this article, we will delve into different string matching algorithms, starting with the simple brute force method and progressing towards more advanced techniques.

Understanding String Matching Algorithm

Given a text array, T [1…..n], of n character and a pattern array, P [1……m], of m characters. The problems are to find an integer s, called a valid shift where 0 ≤ s < n-m and T [s+1……s+m] = P [1……m]. In other words, to find even if P in T, i.e., where P is a substring of T. The items of P and T are characters drawn from some finite alphabet such as {0, 1} or {A, B …..Z, a, b….. z}.

The substrings of a string T [1……n] are represented as T [i……j] for some 0i jn-1, the string generated by the characters in T from index i to index j, inclusive. This method assumes that a string is a substring of itself (i = 0 and j = m). For some 0i jn-1, the correct substring of string T [1……n] is T [1……j]. That is, either i>0 or j m-1 must exist.

Using these descriptions, we can say given any string T [1……n], the substrings are

And proper substrings are

Types of String Matching Algorithms

1. Brute Force Method
The brute force approach is the simplest string matching algorithm. It involves comparing the pattern with every substring of the text until a match is found. This algorithm has a time complexity of O(mn), where ‘m’ is the length of the pattern and ‘n’ is the length of the text. Although straightforward, it becomes inefficient for large texts or patterns.

Brute Force Method

Python

def brute_force(text, pattern):
    n = len(text)
    m = len(pattern)

    for i in range(n - m + 1):
        j = 0
        while j < m and text[i + j] == pattern[j]:
            j += 1
        if j == m:
            return i

    return -1

Sample Description

2. Knuth-Morris-Pratt (KMP) Algorithm
The KMP algorithm improves upon the brute force method by utilizing information from previous comparisons to avoid unnecessary character comparisons. It precomputes a prefix function that helps determine the number of characters to skip in the pattern whenever a mismatch occurs. This results in a time complexity of O(n + m) for string matching.

Knuth-Morris-Pratt (KMP) Algorithm

Python

def compute_prefix(pattern):
    m = len(pattern)
    prefix = [0] * m
    length = 0
    i = 1

    while i < m:
        if pattern[i] == pattern[length]:
            length += 1
            prefix[i] = length
            i += 1
        else:
            if length != 0:
                length = prefix[length - 1]
            else:
                prefix[i] = 0
                i += 1

    return prefix

def kmp(text, pattern):
    n = len(text)
    m = len(pattern)
    prefix = compute_prefix(pattern)

    i = j = 0
    while i < n:
        if pattern[j] == text[i]:
            i += 1
            j += 1
            if j == m:
                return i - j
        else:
            if j != 0:
                j = prefix[j - 1]
            else:
                i += 1

    return -1x

def compute_prefix(pattern): m = len(pattern) prefix = [0] * m length = 0 i = 1 while i < m: if pattern[i] == pattern[length]: length += 1 prefix[i] = length i += 1 else: if length != 0: length = prefix[length - 1] else: prefix[i] = 0 i += 1 return prefix def kmp(text, pattern): n = len(text) m = len(pattern) prefix = compute_prefix(pattern) i = j = 0 while i < n: if pattern[j] == text[i]: i += 1 j += 1 if j == m: return i - j else: if j != 0: j = prefix[j - 1] else: i += 1 return -1x

3. Boyer-Moore Algorithm
The Boyer-Moore algorithm is a powerful string matching algorithm that takes advantage of both the bad character rule and the good suffix rule. It examines the text from right to left and shifts the pattern more intelligently based on mismatched characters. This algorithm has an average time complexity of O(n/m) for string matching.

Boyer-Moore Algorithm

Python

def build_bad_character_table(pattern):
    m = len(pattern)
    table = {}
    
    for i in range(m - 1):
        table[pattern[i]] = i

    return table

def boyer_moore(text, pattern):
    n = len(text)
    m = len(pattern)
    table = build_bad_character_table(pattern)

    i = 0
    while i = 0 and pattern[j] == text[i + j]:
            j -= 1

        if j < 0:
            return i

        if text[i + j] in table:
            i += max(1, j - table[text

[i + j]])
        else:
            i += j + 1

    return -1

Sample Description

4. Rabin-Karp Algorithm
The Rabin-Karp algorithm is a popular string matching algorithm that utilizes hashing. It compares the hash values of the pattern and each substring of the text to determine potential matches. This algorithm has an average time complexity of O(n + m), where ‘n’ is the length of the text and ‘m’ is the length of the pattern.

Rabin-Karp Algorithm

Python

def rabin_karp(text, pattern):
    n = len(text)
    m = len(pattern)
    prime = 101  # A prime number for hash calculation
    pat_hash = 0
    txt_hash = 0
    h = 1

    # Calculate the hash value of the pattern and the first substring of the text
    for i in range(m-1):
        h = (h * 256) % prime
    for i in range(m):
        pat_hash = (256 * pat_hash + ord(pattern[i])) % prime
        txt_hash = (256 * txt_hash + ord(text[i])) % prime

    # Slide the pattern over the text and compare hash values
    for i in range(n - m + 1):
        if pat_hash == txt_hash:
            if pattern == text[i:i + m]:
                return i
        if i < n - m:
            txt_hash = (256 * (txt_hash - ord(text[i]) * h) + ord(text[i + m])) % prime
            if txt_hash < 0:
                txt_hash += prime

    return -1

Sample Description

5. DFA (Deterministic Finite Automaton) Method
The DFA method constructs a finite automaton that represents the pattern to be matched. By traversing the automaton for each character in the text, this algorithm efficiently identifies matches. It has a time complexity of O(n) for preprocessing the pattern and O(m) for matching, making it highly efficient for large texts.

DFA (Deterministic Finite Automaton) Method

Python

Sample Description

def build_dfa(pattern, alphabet):
    m = len(pattern)
    dfa = [[0] * m for _ in range(alphabet)]

    dfa[ord(pattern[0])][0] = 1
    x = 0

    for j in range(1, m):
        for c in range(alphabet):
            dfa[c][j] = dfa[c][x]
        dfa[ord(pattern[j])][j] = j + 1
        x = dfa[ord(pattern[j])][x]

    return dfa

def dfa_matcher(text, pattern):
    n = len(text)
    m = len(pattern)
    alphabet = 256  # Assuming ASCII characters
    dfa = build_dfa(pattern, alphabet)
    j = 0

    for i in range(n):
        j = dfa[ord(text[i])][j]
        if j == m:
            return i - m + 1

    return -1

6. Aho-Corasick Algorithm
The Aho-Corasick algorithm efficiently matches multiple patterns simultaneously. It constructs a trie data structure that allows for efficient pattern matching. This algorithm has a time complexity of O(n + z + m), where ‘n’ is the length of the text, ‘z’ is the total number of occurrences, and ‘m’ is the sum of the lengths of the patterns.

Aho-Corasick Algorithm

Python

class TrieNode:
    def __init__(self):
        self.children = [None] * 256
        self.is_end_of_word = False
        self.word = ""
        self.failure_link = None

def build_trie(patterns):
    root = TrieNode()

    for pattern in patterns:
        node = root
        for char in pattern:
            index = ord(char)
            if not node.children[index]:
                node.children[index] = TrieNode()
            node = node.children[index]
        node.is_end_of_word = True
        node.word = pattern

    return root

def build_failure_links(root):
    queue = []
    for child in root.children:
        if child:
            queue.append(child)
            child.failure_link = root

    while queue:
        current = queue.pop(0)
        for i in range(256):
            if current.children[i]:
                child = current.children[i]
                queue.append(child)
                failure_link = current.failure_link

                while failure_link and not failure_link.children[i]:
                    failure_link = failure_link.failure_link

                child.failure_link = failure_link.children[i] if failure_link else root

def aho_corasick(text, patterns):
    root = build_trie(patterns)
    build_failure_links(root)
    current = root
    matches = []

    for i, char in enumerate(text):
        index = ord(char)
        while current and not current.children[index]:
            current = current.failure_link

        if not current:
            current = root
            continue

        current = current.children[index]

        if current.is_end_of_word:
            matches.append(i - len(current.word) + 1)

    return matches

Sample Description

Applications of String Matching Algorithm

String matching algorithms find applications in various domains and industries. Here are some common applications of string matching algorithms:

1. Text Processing and Search Engines
String matching algorithms are extensively used in text processing tasks, such as searching for keywords or phrases within a large corpus of documents. Search engines employ these algorithms to match user queries with relevant web pages or documents, enabling efficient information retrieval.

2. DNA Sequence Matching
In bioinformatics and genetics, string matching algorithms play a vital role in DNA sequence analysis. These algorithms help identify patterns and similarities within DNA sequences, allowing scientists to study genetic variations, gene mutations, and evolutionary relationships.

3. Plagiarism Detection
String matching algorithms can be applied in plagiarism detection systems to compare text documents and identify instances of copied content. By comparing patterns and similarities between documents, these algorithms can effectively flag potential cases of plagiarism.

4. Data Mining and Pattern Recognition
String matching algorithms are used in data mining tasks to discover patterns and relationships within large datasets. They are employed in areas such as market analysis, customer behavior analysis, fraud detection, and recommendation systems.

5. Virus and Malware Detection
Anti-virus and anti-malware software utilize string matching algorithms to detect known patterns or signatures of malicious code. These algorithms efficiently compare file content against a database of known threats, allowing for the timely identification and removal of viruses and malware.

6. Natural Language Processing (NLP)
In natural language processing, string matching algorithms assist in tasks such as named entity recognition, sentiment analysis, and text classification. These algorithms help identify and match specific words, phrases, or patterns within textual data to extract meaningful information.

7. Spell Checking and Autocorrection
String matching algorithms are employed in spell checking and autocorrection systems to suggest or correct misspelled words. These algorithms compare input words against a dictionary or language model to identify potential matches and offer relevant suggestions for correction.

8. Network Intrusion Detection
String matching algorithms find application in network security systems for intrusion detection. They are used to match patterns or signatures of known attack patterns within network traffic data, enabling the identification and prevention of malicious activities.

Conclusion
String matching algorithms are essential tools for efficiently finding patterns within texts. While the brute force method serves as a basic starting point, advanced algorithms such as the KMP algorithm and Boyer-Moore algorithm provide significant improvements in terms of time complexity. Depending on the nature of the problem and the size of the input, choosing the appropriate string matching algorithm can greatly impact the performance of the overall application. By exploring these algorithms and understanding their underlying principles, developers can make informed decisions when it comes to implementing efficient string matching solutions in their applications.

Frequently Asked Questions (FAQs)

Q1. What is the difference between brute force and advanced string matching algorithms?
The brute force algorithm compares the pattern with every substring of the text, making it simple but inefficient for large inputs. Advanced algorithms like KMP, Boyer-Moore, and Aho-Corasick employ various techniques to optimize the matching process, such as precomputing information or utilizing automaton structures, resulting in improved efficiency and faster execution.

Q2. Can string matching algorithms handle case-insensitive searches?
Yes, string matching algorithms can be adapted to handle case-insensitive searches. By converting both the text and pattern to lowercase or uppercase before comparison, we can ensure that the algorithm identifies matches regardless of case sensitivity.

Q3. Are string matching algorithms limited to single-pattern searches?
No, string matching algorithms can handle both single-pattern and multiple-pattern searches. For single-pattern searches, algorithms like Rabin-Karp, KMP, and Boyer-Moore are commonly used. For multiple-pattern searches, algorithms like Aho-Corasick efficiently match multiple patterns simultaneously, making them ideal for applications like text processing and keyword extraction.

Q4. Do string matching algorithms work efficiently for large texts or datasets?
String matching algorithms, especially the advanced ones, are designed to handle large texts and datasets efficiently. Algorithms like Aho-Corasick and Boyer-Moore demonstrate good scalability and provide excellent performance even with extensive input data.

Q5. Can string matching algorithms be used for approximate or fuzzy matching?
While string matching algorithms primarily focus on exact pattern matching, variations and extensions exist to handle approximate or fuzzy matching. Techniques like Levenshtein distance, Edit distance, or using algorithms like Smith-Waterman or Needleman-Wunsch for sequence alignment can address approximate matching by allowing for slight variations or substitutions in the patterns.

String Matching Algorithm

Understanding String Matching Algorithm

Types of String Matching Algorithms

Brute Force Method

Knuth-Morris-Pratt (KMP) Algorithm

Boyer-Moore Algorithm

Rabin-Karp Algorithm

DFA (Deterministic Finite Automaton) Method

Aho-Corasick Algorithm

Applications of String Matching Algorithm

Frequently Asked Questions (FAQs)

Leave a Reply Cancel reply

Data Mining Tools

Issues in Data Mining

Classification of Data Mining Systems

Data Mining Functionalities

Different Types of Data in Data Mining

The Architecture of Data Mining

Sign in to your account

Login via OTP

Login via OTP

Register with PrepBytes

Understanding String Matching Algorithm

Types of String Matching Algorithms

Brute Force Method

Knuth-Morris-Pratt (KMP) Algorithm

Boyer-Moore Algorithm

Rabin-Karp Algorithm

DFA (Deterministic Finite Automaton) Method

Aho-Corasick Algorithm

Applications of String Matching Algorithm

Frequently Asked Questions (FAQs)

Leave a Reply Cancel reply