# 高级算法：模式搜索的KMP算法详细实现

2021年4月15日17:23:04 发表评论 773 次浏览

## 本文概述

``````Input:  txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output: Pattern found at index 10

pat[] =  "AABA"
Output: Pattern found at index 0
Pattern found at index 9
Pattern found at index 12``````

KMP(Knuth Morris Pratt)模式搜索

``````txt[] = "AAAAAAAAAAAAAAAAAB"
pat[] = "AAAAB"

txt[] = "ABABABCABABABCABABABC"
pat[] =  "ABABAC" (not a worst case, but a bad case for Naive)``````

KMP匹配算法使用模式的退化属性(具有相同子模式的模式在模式中出现多次), 并将最坏情况下的复杂度提高到O(n)。 KMP算法的基本思想是：每当检测到不匹配(在某些匹配之后)时, 我们就已经知道下一个窗口文本中的某些字符。我们利用这些信息来避免匹配我们知道会匹配的字符。让我们考虑下面的例子来理解这一点。

``````Matching Overview
txt = "AAAAABAAABA"
pat = "AAAA"

We compare first window of txt with pat
txt = "AAAAABAAABA"
pat = "AAAA"  [Initial position]
We find a match. This is same as Naive String Matching.

In the next step, we compare next window of txt with pat.
txt = "AAAAABAAABA"
pat =  "AAAA" [Pattern shifted one position]
This is where KMP does optimization over Naive. In this
second window, we only compare fourth A of pattern
with fourth character of current window of text to decide
whether current window matches or not. Since we know
first three characters will anyway match, we skipped
matching first three characters.

Need of Preprocessing?
An important question arises from the above explanation, how to know how many characters to be skipped. To know this, we pre-process pattern and prepare an integer array
lps[] that tells us the count of characters to be skipped.``````

KMP算法对pat []进行预处理, 并构造一个大小为m(与模式大小相同)的辅助lps [], 用于在匹配时跳过字符。

``````Examples of lps[] construction:
For the pattern "AAAA", lps[] is [0, 1, 2, 3]

For the pattern "ABCDE", lps[] is [0, 0, 0, 0, 0]

For the pattern "AABAACAABAA", lps[] is [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]

For the pattern "AAACAAAAAC", lps[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4]

For the pattern "AAABAAA", lps[] is [0, 1, 2, 0, 1, 2, 3]``````

, 在此模式下, 我们将模式滑动一个并在每次移位时比较所有字符, 我们使用lps []中的值来确定要匹配的下一个字符。这个想法是不匹配我们知道会匹配的字符。

• 我们开始将j = 0的pat [j]与当前文本窗口的字符进行比较。
• 我们保持匹配字符txt [i]和pat [j], 并保持i和j递增, 而pat [j]和txt [i]保持匹配.
• 当我们看到一个不匹配
• 我们知道字符pat [0..j-1]与txt [i-j…i-1]相匹配(请注意, j以0开头, 仅在存在匹配项时递增)。
• 从上面的定义中我们还知道lps [j-1]是pat [0…j-1]的字符计数, 它们都是正确的前缀和后缀。
• 从以上两点可以得出结论, 我们不需要将这些lps [j-1]字符与txt [i-j…i-1]匹配, 因为我们知道这些字符仍然可以匹配。让我们考虑上面的例子来理解这一点。
``````txt[] = "AAAAABAAABA"
pat[] = "AAAA"
lps[] = {0, 1, 2, 3}

i = 0, j = 0
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 1, j = 1
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 2, j = 2
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
pat[i] and pat[j] match, do i++, j++

i = 3, j = 3
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 4, j = 4
Since j == M, print pattern found and reset j, j = lps[j-1] = lps[3] = 3

Here unlike Naive algorithm, we do not match first three
characters of this window. Value of lps[j-1] (in above
step) gave us index of next character to match.
i = 4, j = 3
txt[] = "AAAAABAAABA"
pat[] =  "AAAA"
txt[i] and pat[j] match, do i++, j++

i = 5, j = 4
Since j == M, print pattern found and reset j, j = lps[j-1] = lps[3] = 3

Again unlike Naive algorithm, we do not match first three
characters of this window. Value of lps[j-1] (in above
step) gave us index of next character to match.
i = 5, j = 3
txt[] = "AAAAABAAABA"
pat[] =   "AAAA"
txt[i] and pat[j] do NOT match and j> 0, change only j
j = lps[j-1] = lps[2] = 2

i = 5, j = 2
txt[] = "AAAAABAAABA"
pat[] =    "AAAA"
txt[i] and pat[j] do NOT match and j> 0, change only j
j = lps[j-1] = lps[1] = 1

i = 5, j = 1
txt[] = "AAAAABAAABA"
pat[] =     "AAAA"
txt[i] and pat[j] do NOT match and j> 0, change only j
j = lps[j-1] = lps[0] = 0

i = 5, j = 0
txt[] = "AAAAABAAABA"
pat[] =      "AAAA"
txt[i] and pat[j] do NOT match and j is 0, we do i++.

i = 6, j = 0
txt[] = "AAAAABAAABA"
pat[] =       "AAAA"
txt[i] and pat[j] match, do i++ and j++

i = 7, j = 1
txt[] = "AAAAABAAABA"
pat[] =       "AAAA"
txt[i] and pat[j] match, do i++ and j++

We continue this way...``````

## C ++

``````//C++ program for implementation of KMP pattern searching
//algorithm
#include <bits/stdc++.h>

void computeLPSArray( char * pat, int M, int * lps);

//Prints occurrences of txt[] in pat[]
void KMPSearch( char * pat, char * txt)
{
int M = strlen (pat);
int N = strlen (txt);

//create lps[] that will hold the longest prefix suffix
//values for pattern
int lps[M];

//Preprocess the pattern (calculate lps[] array)
computeLPSArray(pat, M, lps);

int i = 0; //index for txt[]
int j = 0; //index for pat[]
while (i <N) {
if (pat[j] == txt[i]) {
j++;
i++;
}

if (j == M) {
printf ( "Found pattern at index %d " , i - j);
j = lps[j - 1];
}

//mismatch after j matches
else if (i <N && pat[j] != txt[i]) {
//Do not match lps[0..lps[j-1]] characters, //they will match anyway
if (j != 0)
j = lps[j - 1];
else
i = i + 1;
}
}
}

//Fills lps[] for given patttern pat[0..M-1]
void computeLPSArray( char * pat, int M, int * lps)
{
//length of the previous longest prefix suffix
int len = 0;

lps[0] = 0; //lps[0] is always 0

//the loop calculates lps[i] for i = 1 to M-1
int i = 1;
while (i <M) {
if (pat[i] == pat[len]) {
len++;
lps[i] = len;
i++;
}
else //(pat[i] != pat[len])
{
//This is tricky. Consider the example.
//AAACAAAA and i = 7. The idea is similar
//to search step.
if (len != 0) {
len = lps[len - 1];

//Also, note that we do not increment
//i here
}
else //if (len == 0)
{
lps[i] = 0;
i++;
}
}
}
}

//Driver program to test above function
int main()
{
char txt[] = "ABABDABACDABABCABAB" ;
char pat[] = "ABABCABAB" ;
KMPSearch(pat, txt);
return 0;
}``````

## Java

``````//JAVA program for implementation of KMP pattern
//searching algorithm

class KMP_String_Matching {
void KMPSearch(String pat, String txt)
{
int M = pat.length();
int N = txt.length();

//create lps[] that will hold the longest
//prefix suffix values for pattern
int lps[] = new int [M];
int j = 0 ; //index for pat[]

//Preprocess the pattern (calculate lps[]
//array)
computeLPSArray(pat, M, lps);

int i = 0 ; //index for txt[]
while (i <N) {
if (pat.charAt(j) == txt.charAt(i)) {
j++;
i++;
}
if (j == M) {
System.out.println( "Found pattern "
+ "at index " + (i - j));
j = lps[j - 1 ];
}

//mismatch after j matches
else if (i <N && pat.charAt(j) != txt.charAt(i)) {
//Do not match lps[0..lps[j-1]] characters, //they will match anyway
if (j != 0 )
j = lps[j - 1 ];
else
i = i + 1 ;
}
}
}

void computeLPSArray(String pat, int M, int lps[])
{
//length of the previous longest prefix suffix
int len = 0 ;
int i = 1 ;
lps[ 0 ] = 0 ; //lps[0] is always 0

//the loop calculates lps[i] for i = 1 to M-1
while (i <M) {
if (pat.charAt(i) == pat.charAt(len)) {
len++;
lps[i] = len;
i++;
}
else //(pat[i] != pat[len])
{
//This is tricky. Consider the example.
//AAACAAAA and i = 7. The idea is similar
//to search step.
if (len != 0 ) {
len = lps[len - 1 ];

//Also, note that we do not increment
//i here
}
else //if (len == 0)
{
lps[i] = len;
i++;
}
}
}
}

//Driver program to test above function
public static void main(String args[])
{
String txt = "ABABDABACDABABCABAB" ;
String pat = "ABABCABAB" ;
new KMP_String_Matching().KMPSearch(pat, txt);
}
}
//This code has been contributed by Amit Khandelwal.``````

## python

``````# Python program for KMP Algorithm
def KMPSearch(pat, txt):
M = len (pat)
N = len (txt)

# create lps[] that will hold the longest prefix suffix
# values for pattern
lps = [ 0 ] * M
j = 0 # index for pat[]

# Preprocess the pattern (calculate lps[] array)
computeLPSArray(pat, M, lps)

i = 0 # index for txt[]
while i <N:
if pat[j] = = txt[i]:
i + = 1
j + = 1

if j = = M:
print ( "Found pattern at index " + str (i - j))
j = lps[j - 1 ]

# mismatch after j matches
elif i <N and pat[j] ! = txt[i]:
# Do not match lps[0..lps[j-1]] characters, # they will match anyway
if j ! = 0 :
j = lps[j - 1 ]
else :
i + = 1

def computeLPSArray(pat, M, lps):
len = 0 # length of the previous longest prefix suffix

lps[ 0 ] # lps[0] is always 0
i = 1

# the loop calculates lps[i] for i = 1 to M-1
while i <M:
if pat[i] = = pat[ len ]:
len + = 1
lps[i] = len
i + = 1
else :
# This is tricky. Consider the example.
# AAACAAAA and i = 7. The idea is similar
# to search step.
if len ! = 0 :
len = lps[ len - 1 ]

# Also, note that we do not increment i here
else :
lps[i] = 0
i + = 1

txt = "ABABDABACDABABCABAB"
pat = "ABABCABAB"
KMPSearch(pat, txt)

# This code is contributed by Bhavya Jain``````

## C#

``````//C# program for implementation of KMP pattern
//searching algorithm
using System;

class GFG {

void KMPSearch( string pat, string txt)
{
int M = pat.Length;
int N = txt.Length;

//create lps[] that will hold the longest
//prefix suffix values for pattern
int [] lps = new int [M];
int j = 0; //index for pat[]

//Preprocess the pattern (calculate lps[]
//array)
computeLPSArray(pat, M, lps);

int i = 0; //index for txt[]
while (i <N) {
if (pat[j] == txt[i]) {
j++;
i++;
}
if (j == M) {
Console.Write( "Found pattern "
+ "at index " + (i - j));
j = lps[j - 1];
}

//mismatch after j matches
else if (i <N && pat[j] != txt[i]) {
//Do not match lps[0..lps[j-1]] characters, //they will match anyway
if (j != 0)
j = lps[j - 1];
else
i = i + 1;
}
}
}

void computeLPSArray( string pat, int M, int [] lps)
{
//length of the previous longest prefix suffix
int len = 0;
int i = 1;
lps[0] = 0; //lps[0] is always 0

//the loop calculates lps[i] for i = 1 to M-1
while (i <M) {
if (pat[i] == pat[len]) {
len++;
lps[i] = len;
i++;
}
else //(pat[i] != pat[len])
{
//This is tricky. Consider the example.
//AAACAAAA and i = 7. The idea is similar
//to search step.
if (len != 0) {
len = lps[len - 1];

//Also, note that we do not increment
//i here
}
else //if (len == 0)
{
lps[i] = len;
i++;
}
}
}
}

//Driver program to test above function
public static void Main()
{
string txt = "ABABDABACDABABCABAB" ;
string pat = "ABABCABAB" ;
new GFG().KMPSearch(pat, txt);
}
}

//This code has been contributed by Amit Khandelwal.``````

## 的PHP

``````<?php
//PHP program for implementation of KMP pattern searching
//algorithm

//Prints occurrences of txt[] in pat[]
function KMPSearch( \$pat , \$txt )
{
\$M = strlen ( \$pat );
\$N = strlen ( \$txt );

//create lps[] that will hold the longest prefix suffix
//values for pattern
\$lps = array_fill (0, \$M , 0);

//Preprocess the pattern (calculate lps[] array)
computeLPSArray( \$pat , \$M , \$lps );

\$i = 0; //index for txt[]
\$j = 0; //index for pat[]
while ( \$i <\$N ) {
if ( \$pat [ \$j ] == \$txt [ \$i ]) {
\$j ++;
\$i ++;
}

if ( \$j == \$M ) {
printf( "Found pattern at index " .( \$i - \$j ));
\$j = \$lps [ \$j - 1];
}

//mismatch after j matches
else if ( \$i <\$N && \$pat [ \$j ] != \$txt [ \$i ]) {
//Do not match lps[0..lps[j-1]] characters, //they will match anyway
if ( \$j != 0)
\$j = \$lps [ \$j - 1];
else
\$i = \$i + 1;
}
}
}

//Fills lps[] for given patttern pat[0..M-1]
function computeLPSArray( \$pat , \$M , & \$lps )
{
//length of the previous longest prefix suffix
\$len = 0;

\$lps [0] = 0; //lps[0] is always 0

//the loop calculates lps[i] for i = 1 to M-1
\$i = 1;
while ( \$i <\$M ) {
if ( \$pat [ \$i ] == \$pat [ \$len ]) {
\$len ++;
\$lps [ \$i ] = \$len ;
\$i ++;
}
else //(pat[i] != pat[len])
{
//This is tricky. Consider the example.
//AAACAAAA and i = 7. The idea is similar
//to search step.
if ( \$len != 0) {
\$len = \$lps [ \$len - 1];

//Also, note that we do not increment
//i here
}
else //if (len == 0)
{
\$lps [ \$i ] = 0;
\$i ++;
}
}
}
}

//Driver program to test above function

\$txt = "ABABDABACDABABCABAB" ;
\$pat = "ABABCABAB" ;
KMPSearch( \$pat , \$txt );

//This code is contributed by chandan_jnu
?>``````

``Found pattern at index 10``

``````pat[] = "AAACAAAA"

len = 0, i  = 0.
lps[0] is always 0, we move
to i = 1

len = 0, i  = 1.
Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.
len = 1, lps[1] = 1, i = 2

len = 1, i  = 2.
Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.
len = 2, lps[2] = 2, i = 3

len = 2, i  = 3.
Since pat[len] and pat[i] do not match, and len> 0, set len = lps[len-1] = lps[1] = 1

len = 1, i  = 3.
Since pat[len] and pat[i] do not match and len> 0, len = lps[len-1] = lps[0] = 0

len = 0, i  = 3.
Since pat[len] and pat[i] do not match and len = 0, Set lps[3] = 0 and i = 4.
We know that characters pat
len = 0, i  = 4.
Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.
len = 1, lps[4] = 1, i = 5

len = 1, i  = 5.
Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.
len = 2, lps[5] = 2, i = 6

len = 2, i  = 6.
Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.
len = 3, lps[6] = 3, i = 7

len = 3, i  = 7.
Since pat[len] and pat[i] do not match and len> 0, set len = lps[len-1] = lps[2] = 2

len = 2, i  = 7.
Since pat[len] and pat[i] match, do len++, store it in lps[i] and do i++.
len = 3, lps[7] = 3, i = 8

We stop here as we have constructed the whole lps[].``````