Faster pattern matching with character classes using prime number encoding

作者:

Highlights:

摘要

In pattern matching with character classes the goal is to find all occurrences of a pattern of length m in a text of length n, where each pattern position consists of an allowed set of characters from a finite alphabet Σ. We present an FFT-based algorithm that uses a novel prime-numbers encoding scheme, which is logn/logm times faster than the fastest extant approaches, which are based on boolean convolutions. In particular, if m|Σ|=nO(1), our algorithm runs in time O(nlogm), matching the complexity of the fastest techniques for wildcard matching, a special case of our problem. A major advantage of our algorithm is that it allows a tradeoff between the running time and the RAM word size. Our algorithm also speeds up solutions to approximate matching with character classes problems—namely, matching with k mismatches and Hamming distance, as well as to the subset matching problem.

论文关键词:Pattern matching with character classes,Subset matching,Hamming distance,FFT-based pattern matching

论文评审过程:Received 30 September 2007, Revised 7 August 2008, Available online 22 August 2008.

论文官网地址:https://doi.org/10.1016/j.jcss.2008.08.005