Question

我在二进制缓冲区上有一个&[u8]切片。我需要解析它，但是我想要使用的很多方法（例如str::find）似乎不适用于切片。

我已经看到我可以通过使用str将缓冲区切片和我的模式转换为from_utf8_unchecked()，但这看起来有点危险（而且非常黑客）。

如何在此切片中找到子序列？我实际上需要模式的索引，而不仅仅是部分的切片视图，所以我认为split不会起作用。

Answer 1

这是一个基于windows迭代器的简单实现。

fn find_subsequence(haystack: &[u8], needle: &[u8]) -> Option<usize> {
    haystack.windows(needle.len()).position(|window| window == needle)
}

fn main() {
    assert_eq!(find_subsequence(b"qwertyuiop", b"tyu"), Some(4));
    assert_eq!(find_subsequence(b"qwertyuiop", b"asd"), None);
}

find_subsequence函数也可以是通用的：

fn find_subsequence<T>(haystack: &[T], needle: &[T]) -> Option<usize>
    where for<'a> &'a [T]: PartialEq
{
    haystack.windows(needle.len()).position(|window| window == needle)
}

Answer 2

我认为标准库不包含此功能。有些libcs有memmem，但目前libc crate没有包装它。但是，您可以使用twoway crate。 rust-bio也实现了一些模式匹配算法。所有这些都应该比使用haystack.windows(..).position(..)

更快

Answer 3

Regex on bytes怎么样？这看起来非常强大。见rust playground demo。

// This shows how to find all null-terminated strings in a slice of bytes
let re = Regex::new(r"(?-u)(?P<cstr>[^\x00]+)\x00").unwrap();
let text = b"foo\x00bar\x00baz\x00";

// Extract all of the strings without the null terminator from each match.
// The unwrap is OK here since a match requires the `cstr` capture to match.
let cstrs: Vec<&[u8]> =
    re.captures_iter(text)
      .map(|c| c.name("cstr").unwrap().as_bytes())
      .collect();
assert_eq!(vec![&b"foo"[..], &b"bar"[..], &b"baz"[..]], cstrs);

Answer 4

我发现memmem crate对于此任务有用：

use memmem::{Searcher, TwoWaySearcher};

let search = TwoWaySearcher::new("dog".as_bytes());
assert_eq!(
    search.search_in("The quick brown fox jumped over the lazy dog.".as_bytes()),
    Some(41)
);

如何在＆amp; [u8]切片中找到子序列？

4 个答案: