编辑：仅限整个单词

Question

我试图在一个月内处理这个问题。我需要在一个大字符串（String ^）中做很多替换（超过一千万）。我也需要快速完成。我的方式是正确的，但程序运行超过30分钟。

问题：我有一个更改表：[strWas1, strWillBe1, strWas2, strWillBe2, ..., strWas10^7, strWillBe10^7]。另外我有一个大字符串，其中可以包含strWasN部分字符串，但它也可以包含something-elsestrWas1，我不想更改它，因为“something-elsestrWas1”不是“{{1} }”。

例如String是：

“我有两只狗，三只狗，还有dogikong，5只狗，狗狗。狗， Dogs，DoGs，33DoGs00“

现在我需要将所有孤立的“狗”从字母（“狗”是strWas1）改为“猫”（“猫”是strWillBe1）。结果应该是：

“我有两只猫，三只猫，还有狗狗，5只猫，猫，猫，猫，猫，33cats00“

我的最后一次尝试是：

strWas1

但这无休止地工作

新版本（感谢Vlad Feinstein）：

array<String^>^ strArray = gcnew array<String^>(9999999);
strArray[0] = gcnew String("dogs");
strArray[1] = gcnew String("cats");
//...
strArray[9999998] = gcnew String("whatReplace");
strArray[9999999] = gcnew String("newText");
bool found = false;
int index;
bool doThis = true;
String ^ notAllowed = u8"aąbcćdeęfghijklłmnńoópqrsśtuvwxyzźżAĄBCĆDEĘFGHIJKLŁMNŃOÓPQRSŚTUVWXYZŹŻёйцукенгшщзхъфывапролджэячсмитьбюЁЙЦУКЕНГШЩЗХЪФЫВАПРОЛДЖЭЯЧСМИТЬБЮ";
String ^ text = u8"I have two dogs, three notdogs, also dogsikong, 5dogs, -dogs. DOGS, Dogs, DoGs, 33DoGs00";
for (int i = 0; i < 9999999; i+=2) {
    while (found = text->Contains(strArray[i])) {
        index = text->IndexOf(strArray[i]);
        MessageBox::Show(index.ToString());
        doThis = true;
        if (index == 0) {
            for (int j = 0; j < notAllowed->Length; j++) {
                if (text->Substring(strArray[i]->Length, 1) == notAllowed->Substring(j, 1)) doThis = false;
            }
        }
        else if (text->Length - index - strArray[i]->Length) {
            for (int j = 0; j < notAllowed->Length; j++) {
                if (text->Substring(index-1, 1) == notAllowed->Substring(j, 1)) doThis = false;
            }
        }
        else {
            for (int j = 0; j < notAllowed->Length; j++) {
                if ((text->Substring(index - 1, 1) == notAllowed->Substring(j, 1)) || (text->Substring(index+strArray[i]->Length,1)== notAllowed->Substring(j, 1))) doThis = false;
            }
        }
        if (doThis) {
        text = text->Substring(0, index) + strArray[i + 1] + text->Substring(index + strArray[i]->Length, text->Length - index - strArray[i]->Length);
    }
    }
}

当然它的版本还不是那么快。快速版写了David Yaw。

Answer 1

您的代码中存在许多可能导致问题的问题，但主要的逻辑错误是：

while (found = text->Contains(strArray[i]))

应该是

while (found == text->Contains(strArray[i]))

由于==是比较运算符，而=是赋值运算符。因此，您总是在无限循环中分配，因此您的while循环。

Answer 2

嗯......不是吗？

while (found == text->Contains(strArray[i]))

用于比较。但我以前没有计算found。所以我计算在while中找到并检查它是否为真。这是允许的。

while (found = text->Contains(strArray[i]))

正是如此：

found = text->Contains(strArray[i])
while (found==true)

至少在正常的C ++中，它正在运行。在这里，我也没有遇到任何问题。

Answer 3

有一种更好的方法可以做到这一点，而不是盲目地检查每一百万个替换字符串。让.Net散列字符串，让它以这种方式进行检查。

如果我们收到了发现＆amp;将字符串替换为字典，我们可以使用.Net的哈希查找来查找我们需要替换的字符串。

如果我们单步执行字符串中的每个字符，它可能是5个字符'搜索'字符串的开头，或者是4字符'搜索'字符串等的开头，或者它可能不是完全是'搜索'字符串的一部分，在这种情况下，它将直接复制到输出。如果我们找到'搜索'字符串，我们会将替换写入输出，并将所需的输入字符数标记为已消耗。

根据您的描述，您在搜索字符串时似乎需要不区分大小写的比较。您可以使用区分大小写或不敏感的方法，只需在构造Dictionary时指定您喜欢的内容。

String^ BigFindReplace(
    String^ originalString, 
    Dictionary<String^, String^>^ replacementPairs)
{
    // First, get the lengths of all the 'search for' strings in the replacement pairs.
    SortedSet<int> searchForLengths;
    for each (String^ searchFor in replacementPairs->Keys)
    {
        searchForLengths.Add(searchFor->Length);
    }

    // Searching for an empty string isn't valid: remove length zero, if it's there.
    searchForLengths.Remove(0);

    StringBuilder result;

    // Step through the input string. For each character:
    // A) See if the character is the beginning of one of the 'search for' strings.
    //    If so, then insert the 'replace with' string into the output buffer.
    //    Skip over this character and the rest of the 'search for' string that we found.
    // B) If it's not the beginning of a 'search for' string, copy it to the output buffer.

    for(int i = 0; i < originalString->Length; i++)
    {
        bool foundSomething = false;
        int foundSomethingLength = 0;
        for each (int len in searchForLengths.Reverse())
        {
            if (i > (originalString->Length - len))
            {
                // If we're on the last 4 characters of the string, we can ignore 
                // all the 'search for' strings that are 5 characters or longer.
                continue;
            }

            String^ substr = originalString->Substring(i, len);

            String^ replaceWith;
            if (replacementPairs->TryGetValue(substr, replaceWith))
            {
                // We found the section of the input string that we're looking at in our 
                // 'search for' list! Inser the 'replace with' into the output buffer.
                result.Append(replaceWith);
                foundSomething = true;
                foundSomethingLength = len;
                break; // don't try to find more 'search for' strings.
            }
        }

        if(foundSomething)
        {
            // We found & already inserted the replacement text. Just increment 
            // the loop counter to skip over the rest of the characters of the 
            // found 'search for' text.

            i += (foundSomethingLength - 1); // "-1" because the for loop has its own "+1".
        }
        else
        {
            // We didn't find any of the 'search for' strings, 
            // so this is a character that just gets copied.
            result.Append(originalString[i]);
        }
    }

    return result.ToString();
}

我的测试应用：

int main(array<System::String ^> ^args)
{
    String^ text = "I have two dogs, three notdogs, also dogsikong, 5dogs, -dogs. DOGS, Dogs, DoGs, 33DoGs00";

    Dictionary<String^, String^>^ replacementPairs = 
        gcnew Dictionary<String^, String^>(StringComparer::CurrentCultureIgnoreCase);

    replacementPairs->Add("dogs", "cats");
    replacementPairs->Add("pigs", "cats");
    replacementPairs->Add("mice", "cats");
    replacementPairs->Add("rats", "cats");
    replacementPairs->Add("horses", "cats");

    String^ outText = BigFindReplace(text, replacementPairs);

    Debug::WriteLine(outText);

    String^ text2 = "I have two dogs, three notpigs, also miceikong, 5rats, -dogs. RATS, Horses, DoGs, 33DoGs00";
    String^ outText2 = BigFindReplace(text, replacementPairs);

    Debug::WriteLine(outText2);

    return 0;
}

输出：

I have two cats, three notcats, also catsikong, 5cats, -cats. cats, cats, cats, 33cats00
I have two cats, three notcats, also catsikong, 5cats, -cats. cats, cats, cats, 33cats00

编辑：仅限整个单词

好的，所以我们只需要替换整个单词。为此，我编写了一个帮助方法，将一个字符串拆分为单词＆amp;非词。（这与内置的String :: Split方法不同：String :: Split不返回分隔符，我们在这里需要它们。）

一旦我们有一个字符串数组，其中每个字符串都是一个单词或一堆非单词字符（例如，分隔符，空格等），那么我们可以通过字典运行每个字符串。因为我们一次只做一个字，而不是一次只写一个字，所以效率更高。

array<String^>^ SplitIntoWords(String^ input)
{
    List<String^> result;
    StringBuilder currentWord;
    bool currentIsWord = false;

    for each (System::Char c in input)
    {
        // Words are made up of letters. Word separators are made up of 
        // everything else (numbers, whitespace, punctuation, etc.)
        bool nextCharIsWord = Char::IsLetter(c);

        if(nextCharIsWord != currentIsWord)
        {
            if(currentWord.Length > 0)
            {
                result.Add(currentWord.ToString());
                currentWord.Clear();
            }
            currentIsWord = nextCharIsWord;
        }

        currentWord.Append(c);
    }

    if(currentWord.Length > 0)
    {
        result.Add(currentWord.ToString());
        currentWord.Clear();
    }

    return result.ToArray();
}

String^ BigFindReplaceWords(
    String^ originalString, 
    Dictionary<String^, String^>^ replacementPairs)
{
    StringBuilder result;

    // First, separate the input string into an array of words & non-words.
    array<String^>^ asWords = SplitIntoWords(originalString);

    // Go through each word & non-word that came out of the split. If a word or 
    // non-word is in the replacement list, add the replacement to the output. 
    // Otherwise, add the word/nonword to the output.

    for each (String^ word in asWords)
    {
        String^ replaceWith;
        if (replacementPairs->TryGetValue(word, replaceWith))
        {
            result.Append(replaceWith);
        }
        else
        {
            result.Append(word);
        }
    }

    return result.ToString();
}

我的测试应用：

int main(array<System::String ^> ^args)
{
    String^ text = "I have two dogs, three notdogs, also dogsikong, 5dogs, -dogs. DOGS, Dogs, DoGs, 33DoGs00";

    array<String^>^ words = SplitIntoWords(text);
    for (int i = 0; i < words->Length; i++)
    {
        Debug::WriteLine("words[{0}] = '{1}'", i, words[i]);
    }

    Dictionary<String^, String^>^ replacementPairs = 
        gcnew Dictionary<String^, String^>(StringComparer::CurrentCultureIgnoreCase);

    replacementPairs->Add("dogs", "cats");
    replacementPairs->Add("pigs", "cats");
    replacementPairs->Add("mice", "cats");
    replacementPairs->Add("rats", "cats");
    replacementPairs->Add("horses", "cats");

    String^ outText = BigFindReplaceWords(text, replacementPairs);

    Debug::WriteLine(outText);

    String^ text2 = "I have two dogs, three notpigs, also miceikong, 5rats, -dogs. RATS, Horses, DoGs, 33DoGs00";
    String^ outText2 = BigFindReplaceWords(text2, replacementPairs);

    Debug::WriteLine(outText2);

    return 0;
}

结果：

words[0] = 'I'
words[1] = ' '
words[2] = 'have'
words[3] = ' '
words[4] = 'two'
words[5] = ' '
words[6] = 'dogs'
words[7] = ', '
words[8] = 'three'
words[9] = ' '
words[10] = 'notdogs'
words[11] = ', '
words[12] = 'also'
words[13] = ' '
words[14] = 'dogsikong'
words[15] = ', 5'
words[16] = 'dogs'
words[17] = ', -'
words[18] = 'dogs'
words[19] = '. '
words[20] = 'DOGS'
words[21] = ', '
words[22] = 'Dogs'
words[23] = ', '
words[24] = 'DoGs'
words[25] = ', 33'
words[26] = 'DoGs'
words[27] = '00'
I have two cats, three notdogs, also dogsikong, 5cats, -cats. cats, cats, cats, 33cats00
I have two cats, three notpigs, also miceikong, 5cats, -cats. cats, cats, cats, 33cats00

Answer 4

ПётрВасильевич，一些建议：

用Char [x]替换子串（x，1）。
丢弃notAllowed字符串并使用.NET的Char.IsLetter Method，或者至少在设置for()

doThis = false;

如果您需要从index到字符串末尾的子字符串，则无需计算长度;只需使用带有一个参数的表单：public string Substring(int startIndex)
不要使用text->Contains();无论如何，您需要调用text->IndexOf()，只需将该索引与-1进行比较。
1000万字???英语和俄语合并不多！

使用String.IndexOf Method (Char, Int32)的双参数形式来指定从哪里开始搜索（从之前找到的单词的位置），以避免反复搜索字符串的开头。这样的事情：

for (int i = 0; i < 9999999; i += 2) 
{
    int index = 0;
    while ((index = text->IndexOf(strArray[i], index)) != -1)
    {
        doThis = true;
        // is there one more char?
        if (index + strArray[i]->Length < text->Length) 
        {
            if(Char.IsLetter(text->Char[strArray[i]->Length]))
                doThis = false;
        }
        // is there previous char?
        if (index > 0)
        {
            if (Char.IsLetter(text->Char[index - 1]))
                doThis = false;
        }
        if (doThis)
            text = text->Substring(0, index) + strArray[i + 1] +
                   text->Substring(index + strArray[i]->Length);
    }
}

在while()循环中，将找到的字符串的索引收集到一个数组中，然后在一次传递中完成所有替换相同的单词。如果text中有多个同一个词出现，那么这一点特别有用。

如何以最快的方式做大字符串中的许多小变化。 Visual C ++

4 个答案:

编辑：仅限整个单词