Question

我目前正在从Rust移植Python中的一个库，并找到了一条我无法找到正确“翻译”的行：

right = s.index(sep, left)

其中right是索引sep之后的字符串s中找到的left的第一个实例的索引。

这里可以看到一个简单的例子：

Python 3

>>> s = "Hello, my name is erip and my favorite color is green."
>>> right = s.index("my", 10) # Find index of first instance of 'my' after index 10
>>> print right
27
>>> print s[27:]
my favorite color is green.

我在Rust的尝试是：

// s: &str, sep: &str, left: usize
let right = s[left..].find(sep).unwrap() + left;

这将在left之后搜索sep的字节。使用ASCII字符时此seems to work。但是，使用Unicode时似乎存在问题：

Python 3

>>> s = "Hello, mÿ name is erip and mÿ favorite color is green."
>>> right = s.index("mÿ", 10)
>>> print(right)
27

Rust

fn main() {
    let sep: &str = "mÿ";
    let left: usize = 10;
    let s: &str = "Hello, mÿ name is erip and mÿ favorite color is green.";
    let right = s[left..].find(sep).unwrap() + left;
    println!("{}", right); //prints 28
}

我意识到Python 2也会提供28，因为它本身不支持Unicode，但我想模仿Python 3的结果。

问题是因为Rust中的usize是指字符串中 bytes 的数量，因为“mÿ”实际上需要3个字节来编码。如何在Rust中获得所需的行为？

我正在使用rustc 1.4.0。

Answer 1

让我们稍微重述一下这个问题，因为我们不清楚index的单位是什么。人类相信字符串很容易，因为我们一直在使用它们。然而，事情并不像我们想的那么容易。

Rust认为字符串（&str或String）是UTF-8编码的字节序列。使用字节偏移量跳转到字符串是O（1），并且您真的希望该级别的性能保证能够构建更复杂的事物。

我不知道Python认为该索引是什么。一旦超越简单的编码方案（如ASCII，其中一个字符是一个字节），它就会 hard 。根据您的需要，有多种方法可以对Unicode字符串进行分块。两个显而易见的是Unicode codepoint和grapheme。

由于代码点可以使用char在Rust中表示，这就是我想要的。但是，你是唯一可以解决这个问题的人。

此外，由于您请求结果为28，因此必须是字符串中的字节数。跳过N个代码点但返回字节有点奇怪，但它就是它。

~~既然我们知道自己在做什么~~ ......让我们试试吧。（参见下一个我更好地阅读所需结果的解决方案。）

您需要使用的关键是char_indices。这是一个O（n）操作，它遍历字符串并为您提供每个代码点及其相应的字节偏移量。

然后，只需将它们组合在一起并正确处理从字符串末尾走过的情况。 Rust的强大类型很明显，这很明显！

// `index` is the number of Unicode codepoints to skip
// The result is the number of **bytes** inside the haystack
// that the needle can be found.
fn python_index(haystack: &str, needle: &str, index: usize) -> Option<usize> {
    haystack.char_indices().nth(index).and_then(|(byte_idx, _)| {
        let leftover = &haystack[byte_idx..];
        leftover.find(needle).map(|inner_idx| inner_idx + byte_idx)
    })
}

fn main() {
    let right = python_index("Hello, mÿ name is erip and mÿ favorite color is green.", "mÿ", 10);
    println!("{:?}", right); // prints Some(28)
}

我们执行与上面相同的高级概念，但是一旦找到needle，我们就会重新设置并重新遍历代码点。当我们找到子字符串的相同字节偏移量时，我们终止。

然后，这只是计算我们看到的角色并添加我们已经跳过的数字。

// `index` is the number of Unicode codepoints to skip
// The result is the number of codepoints inside the haystack
// that the needle can be found.
fn python_index(haystack: &str, needle: &str, index: usize) -> Option<usize> {
    haystack.char_indices().nth(index).and_then(|(byte_idx, _)| {
        let leftover = &haystack[byte_idx..];

        leftover.find(needle).map(|inner_offset| {
            leftover.char_indices().take_while(|&(inner_inner_offset, _)| {
                inner_inner_offset != inner_offset
            }).count() + index
        })
    })
}

fn main() {
    let right = python_index("Hello, mÿ name is erip and mÿ favorite color is green.", "mÿ", 10);
    println!("{:?}", right); // prints Some(27)
}

这当然感觉不是超级有效的;你想要进行基准测试以了解它的票价。但是，find实现非常优化，所以我宁愿使用它，然后直接通过字符并信任缓存并预取来帮助我^ _ ^。

在Rust中模拟Python的`index（separator，start_index）`

1 个答案: