Explain different string matching algorithms.

137views

written 9.4 years ago by

teamques10 ★ 70k

The native method:

i. It is the simplest method which uses brute force approach.

ii. It is a straight forward approach of solving the problem.

iii. The algorithm performs a checking at all positions in the text and check whether an occurrence of the pattern starts there or not.

iv. After each attempt it shifts the pattern by exactly one position to the right.

v. If match is found then return otherwise the matching process is continued by shifting one character to the right.

vi. We have to perform n comparisons even if there is no match at all in text for the given pattern.

Example:

Consider the text and pattern given below.

0	1	2	3	4
H	U	M	A	N

TEXT

0	1	2
M	A	N

PATTERN

Now we will start finding match for pattern from 0th location in Text. If the match is not found then shift to the right by 1 position.

Step 1:

Since no match is found. Therefore shift to the right by 1 position.

Step 2:

No match is found.

Step 3:

Hence return index 2, because a match with the pattern is found from the location in text.

Rabin-Karp method:

i. Rabin-Karp method is based on hashing technique.

ii. This algorithm assumes each character to be a digit in radix-d notation.

iii. It makes use of equivalence of two numbers modulo a third number.

iv. The computation of pattern is done by using Horner’s rule as follows.

P = p [m] + d(p[m - 1] + d(p[m - 2] + ….d(p[2] + d(p[1]))))

Example:

Given T = 31415926535 and P = 26

We choose q = 11, P mod q = 26 mod 11 = 4

enter image description here

53 mod 11 = 9 not equal to 4

Now, compute ts+1

$t_{s+1} = (d(t_s − d^{m−1} T[s + 1]) + T[s + m + 1]) \\ \hspace{0.7cm}= (d(ts − d m−1. Old high order digit) + new low order digit \\ \hspace{0.7cm}= 10 (26 – 10 x 2) + 5 \\ \hspace{0.7cm}= 60 + 5 \\ \hspace{0.7cm}= 65$

Therefore, we have found the exact match against the text at 7th position.

Knuth-Morris-Pratt method:

i. This algorithm is named after scientist Knuth, Morris and Pratt.

ii. The basic idea behind this algorithm is to build a prefix array.

iii. This array is also called as Π array.

iv. Prefix array is build using the prefix and suffix information of pattern.

v. This algorithm achieves the efficiency of O (m +n) which is optimal in worst case.

Example:

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14
Text	b	a	d	B	a	b	a	b	a	b	a	d	A	a	b
Pattern	a	b	a	B	a	d	a

Compute Prefix table:

0	1	2	3	4	5	6
A	b	a	b	a	D	A
0	0	1	2	3	0	0

Text [4] = Pattern [0] so increment.

Text [5] = Pattern [1] so increment.

Text [6] = Pattern [2] so increment.

Text [7] = Pattern [3] so increment.

Text [8] = Pattern [4] so increment.

Now Text [9] ≠ Pattern [5].

Now we backtrack, therefore we will be in position of Pattern [4]. Consult prefix table [4] = 3.

So we will compare Text [9] with Pattern [3].

Text [9] = Pattern [3] so increment.

Text [10] = Pattern [4] so increment.

Text [11] = Pattern [5] so increment.

Text [12] = Pattern [6] so increment.

Since entire pattern is matching with given text array at 6th position.

The pseudo code for KMP matcher algorithm:

Boyer – Moore Algorithm:

The algorithm works with two basic ideas.

i. Comparing the pattern with text from right to left.

ii. The shift size can be computed using two tables:

Bad symbol table.
Good suffix table.

Algorithm:

i. Construct a bad symbol table. Then obtain bad symbol shift by the following formula.

D1 = max {t(T) – k, 1}

ii. Construct a good suffix shift table D2.

iii. Place the pattern from begging of text and start scanning the pattern from right to left.

iv. Repeat steps until matching substring is not found or pattern goes beyond last character of text.

Example:

Text	B	H	U	T	T	O		K	N	O	W	S		T	O	R	O	N	T	O
Pattern	T	O	R	O	N	T	O

Construct bad symbol table:

6	5	4	3	2	1
T	O	R	O	N	T	O

All the remaining alphabet’s entry = length of pattern = 7

Construct good suffix table:

As in text, space character is encountered the shift of 7 position can be made. .

Text	B	H	U	T	T	O		K	N	O	W	S		T	O	R	O	N	T	O
Pattern								T	O	R	O	N	T	O

As T from text ≠ 0 from pattern. T entry in bad symbol is 1. That means shift pattern by 1 position to right.

enter image description here

The blank space from text is mismatched with ‘N’ from pattern.

D1 = MAX {(text (-) – k), 1} = MAX {(7 - 2), 1} = 5

Now we will obtain good suffix shift D2. As 2 characters from suffix of the pattern are matching D2 (2) = 5.

The shift size D = MAX {D1, D2} = {5, 5} = 5

Therefore shift the pattern by 5 positions ahead.

Text	B	H	U	T	T	O		K	N	O	W	S		T	O	R	O	N	T	O
Pattern														T	O	R	O	N	T	O

Hence all the characters from pattern are matching with the text’s character.

ADD COMMENT EDIT