written 5.7 years ago by |
String matching algorithms, are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text
Why do we need string matching? String matching is used in almost all the software applications straddling from simple text editors to the complex NIDS
There are many types of String Matching Algorithms like:-
1) The Naive string-matching algorithm
2) The Rabin-Krap algorithm
3) String matching with finite automata
4) The Knuth-Morris-Pratt algorithm But we discuss about 2 types of string matching algorithms.
i. The Naive string-matching algorithm
ii. The Rabin-Krap algorithm
STRING MATCHING
- To find all occurrences of a pattern in a given text. We can formalize the above statement by saying:
Find a given pattern p[1..m] in text
T[1..n] with n>=m.
Given a pattern P[1..m] and a text T[1..n], find all occurrences of P in T. Both P and T belong to å*.
- P occurs with shift s (beginning at s+1): P[1]=T[s+1],P[2]=T[s+2],…,P[m]=T[s+m]. If so, call s is a valid shift, otherwise, an invalid shift.
Note: one occurrence begins within another one: P=abab, T=abcabababbc, P occurs at s=3 and s=5.
*text is the string that we are searching.
*pattern is the string that we are searching for.
*Shift is an offset into a string.
BASIC CLASSIFICATION
1. Naïve algorithm:
The naive approach for solving the string searching problem is accomplished by performing a Brute-Force comparison of each character in the pattern at each possible placement of the pattern in the string. This algorithm is O(mn) in the worst case.
2. Rabin – Karp algorithm:
String matching algorithm that compares string’s hash values, rather than string themselves. Performs well in practice, and generalized to other algorithm for related problems, such as two dimensional pattern matching.
3. Knuth-Morris-Pratt algorithm:
It is improved on the Brute-force algorithm and the new algorithm is capable of running O(m+n) in the worst case. This algorithm improves the running time by taking advantage of tagged borders.
4. Boyer-Moore algorithm:
The idea behind the Boyer-Moore algorithm is information gain. Here information is gained by beginning the comparison from the end of the pattern instead of the beginning. It performs the string searching task in sub linear time in the average case, which even KMP algorithm could not accomplish at that time.
THE PROBLEM OF STRING MATCHING
Given a string ‘S’, the problem of string matching deals with finding whether a pattern ‘p’ occurs in ‘S’ and if ‘p’ does occur then returning position in ‘S’ where ‘p’ occurs. O(mn) aproach
One of the most obvious approach towards the string matching problem would be to compare the first element of the pattern to be searched ‘p’, with the first element of the string ‘S’ in which to locate ‘p’.
If the first element of ‘p’ matches he first element of ‘S’, compare the second element of ‘p’ with second element of ‘S’. If match found proceed
likewise until entire ‘p’ is found. If a mismatch is found at any position, shift ‘p’ one position to the right and repeat comparison beginning from first element of ‘p’.
HOW DOES THE O(MN) APPROACH WORK Below is an illustration of how the previously described O(mn) approach works.
String S a b c a b a a b c a b a c
Pattern P a b a a
Step 1: compare p[1] with S[1]
S a b c a b a a b c a b a c
$\uparrow$
P a b a a
Step 2: compare p[2] with S[2]
S a b c a b a
$\uparrow$
P a b a a
Step 3: compare p[3] with S[3]
S a b c a b a a b c a b a c
$\uparrow$
P a b a a
Mismatch occurs here
Since mismatch is detected, shift ‘p’ one position to the left and perform steps analogous to those from step 1 to step 3. At position where mismatch is detected, shift ‘p’ one position to the right and repeat matching procedure.
Finally, a match would be found after shifting ‘p’ three times to the right side.