Question

我正在尝试找到在Text字符串中搜索子字符串的最快方法。以下是所需的输出：

findSubstringIndices :: Text -> Text -> [Int]
findSubstringIndices "asdfasdf" "as" == [0, 4]  -- 0-indexed
findSubstringIndices "asdasdasdasd" "asdasd" == [0, 3, 6]  -- matches can overlap

在我的应用程序中，子字符串是一个固定的6个字母的单词，但是要搜索的字符串很长（假设超过30亿个字母）。我目前的方法是使用KMP包：

import Data.Text.Lazy as T
import Data.Algorithms.KMP as KMP
findSubstringIndices a b = KMP.match (KMP.build $ T.unpack b) $ T.unpack a

但它似乎是对Text的紧凑性的巨大浪费。在没有unpack ing的情况下有没有（最好是简洁的）方法呢？

我知道breakOnAll中有一个名为Text的函数，但它不符合我允许重叠匹配的要求。

编辑：根据@ReidBarton的建议，我实现了一个不需要unpack的版本，这确实更快。但是我不确定这是最快的。

findSubstringIndicesC t a b = let (l, r) = T.breakOn b a in case r of
    "" -> []
    _  -> T.length l : findSubstringIndicesC (t + T.length l + 1) (T.tail r) b

findSubstringIndices = findSubstringIndicesC 0

Answer 1

Data.ByteString.Search的介绍性文字表明Boyer-Moore通常最快，链接到基于DFA的算法，在某些特殊情况下更好，并提供近似的性能比。您不应该使用Text来表示DNA序列。 Text用于自然语言，可能是多语言文本。 DNA序列看起来完全不同。

如何快速搜索“Text”中的子串？

1 个答案: