用于字符串比较和替换的C ++优化

时间:2015-05-06 19:43:58

标签: c++ string optimization vector char

我正在尝试优化我编写的比较两个字符串的函数,然后替换第一个字符串中的字符(如果在第二个字符串中找不到它们)。您是否认为,例如在更改为大写时将字符串转换为字符向量会产生性能提升?然而,我无法在2 for循环中看到很多方法。任何一般提示将不胜感激!

void optimize(std::string & toBeProcessed, const std::string & toBeIgnored, char ch)
{
    std::string upperProcessed = toBeProcessed;
    std::transform(upperProcessed.begin(), upperProcessed.end(), upperProcessed.begin(), ::toupper);
    std::string upperIgnored = toBeIgnored;
    std::transform(upperIgnored.begin(), upperIgnored.end(), upperIgnored.begin(), ::toupper);
    std::vector<char> vectorAfterProcessed;
    bool found;
    for(int i = 0; i <= upperProcessed.size(); i++)
    {
        found = false;
        for(int j = 0; j <= upperIgnored.size(); j++)
        {
            if(upperProcessed[i] == upperIgnored[j])
            {
                vectorAfterProcessed.push_back(upperProcessed[i]);
                found = true;
            }
        }
        if(found != true)
        {
            vectorAfterProcessed.push_back(ch);
        }
    }
    std::string test(vectorAfterProcessed.begin(), vectorAfterProcessed.end());
}

3 个答案:

答案 0 :(得分:3)

请注意,char只能包含256个值。您只需扫描ignored字符串一次,并填充其中出现的字符的位掩码:

uint32_t bitmask[8] = {0};
for(int j = 0; j < upperIgnored.size(); j++)
{
    uint8_t chr = static_cast<uint8_t>(upperIgnored[j]);
    bitmask[chr >> 5] |= (1 << (chr & 31));
}

之后代替内部循环,只需检查位掩码的值:

for(int i = 0; i < upperProcessed.size(); i++)
{
    uint8_t chr = static_cast<uint8_t>(upperProcessed[i]);
    if(bitmask[chr >> 5] & (1 << (chr & 31)))
    {
        vectorAfterProcessed.push_back(upperProcessed[i]);
    }
    else
    {
        vectorAfterProcessed.push_back(ch);
    }
}

另请注意,您的代码还有另外两个问题:您的循环包含右端,这很可能导致段错误/访问冲突,并且在设置break后您没有found为true,如果在processed字符串中多次出现,则会导致字符多次附加到ignored字符串。

答案 1 :(得分:1)

我正处于类似于Ishmael的解决方案的中间,只有我很想使用256字节的布尔数组,而不是Ishamel的缓存友好的64字节位掩码数组。

所以我非常好奇这些是如何相互表现并掀起一个快速的基准。

基准游戏

#include <string>
#include <algorithm>
#include <iostream>
#include <cassert>
#include <vector>
#include <ctime>
#include <cctype>

using namespace std;    

static string optimize_original(string& toBeProcessed, const string& toBeIgnored, char ch)
{
    string upperProcessed = toBeProcessed;
    transform(upperProcessed.begin(), upperProcessed.end(), upperProcessed.begin(), ::toupper);
    string upperIgnored = toBeIgnored;
    transform(upperIgnored.begin(), upperIgnored.end(), upperIgnored.begin(), ::toupper);
    vector<char> vectorAfterProcessed;
    bool found;
    for(size_t i = 0; i <= upperProcessed.size(); i++)
    {
        found = false;
        for(size_t j = 0; j <= upperIgnored.size(); j++)
        {
            if(upperProcessed[i] == upperIgnored[j])
            {
                vectorAfterProcessed.push_back(upperProcessed[i]);
                found = true;
            }
        }
        if(found != true)
            vectorAfterProcessed.push_back(ch);
    }
    return string(vectorAfterProcessed.begin(), vectorAfterProcessed.end());
}

static string optimize_paul(string toBeProcessed, string toBeIgnored, char ch)
{
    transform(toBeProcessed.begin(), toBeProcessed.end(), toBeProcessed.begin(), ::toupper);
    transform(toBeIgnored.begin(), toBeIgnored.end(), toBeIgnored.begin(), ::toupper);
    string test;
    size_t start = 0;
    while (start < toBeProcessed.size())
    {
        size_t n = toBeProcessed.find_first_not_of(toBeIgnored, start);
        if ( n != string::npos)
        { 
            toBeProcessed[n] = ch;
            start = n+1;
        }
        else
            break;
    }
    return toBeProcessed;
}

static string optimize_ike(string input, const string& to_keep, char rep)
{
    bool used[256] = {false};
    for (size_t j=0; j < to_keep.size(); ++j)
    {
        used[tolower(to_keep[j])] = true;
        used[toupper(to_keep[j])] = true;
    }
    for (size_t j=0; j < input.size(); ++j)
    {
        if (used[input[j]])
            input[j] = toupper(input[j]);
        else
            input[j] = rep;
    }
    return input;
}

static string optimize_ishmael(string input, const string& to_keep, char rep)
{
    uint32_t bitmask[8] = {0};
    for (size_t j=0; j < to_keep.size(); ++j)
    {
        const uint8_t lower = static_cast<uint8_t>(tolower(to_keep[j]));
        bitmask[lower >> 5] |= (1 << (lower & 31));

        const uint8_t upper = static_cast<uint8_t>(toupper(to_keep[j]));
        bitmask[upper >> 5] |= (1 << (upper & 31));
    }
    for (size_t j=0; j < input.size(); ++j)
    {
        const uint8_t chr = static_cast<uint8_t>(input[j]);
        if (bitmask[chr >> 5] & (1 << (chr & 31)))
            input[j] = toupper(input[j]);
        else
            input[j] = rep;
    }
    return input;
}

static double sys_time()
{
    return static_cast<double>(clock()) / CLOCKS_PER_SEC;
}

enum {string_len = 10000000};
enum {num_tests = 5};

int main()
{
    const string to_keep = "abcd";
    for (int k=0; k < 5; ++k)
    {
        string in;
        for (int j=0; j < string_len; ++j)
            in += rand() % 26 + 'A';

        double time = sys_time();
        volatile const string a = optimize_original(in, to_keep, '*');
        cout << ((sys_time() - time) * 1000) << " ms for original" << endl;

        time = sys_time();
        volatile const string b = optimize_paul(in, to_keep, '*');
        cout << ((sys_time() - time) * 1000) << " ms for Paul's" << endl;

        time = sys_time();
        volatile const string c = optimize_ike(in, to_keep, '*');
        cout << ((sys_time() - time) * 1000) << " ms for Ike's" << endl;

        time = sys_time();
        volatile const string d = optimize_ishmael(in, to_keep, '*');
        cout << ((sys_time() - time) * 1000) << " ms for Ishmael's" << endl;

        cout << endl;
    }
}

结果

515 ms for original
218 ms for Paul's
78 ms for Ike's
63 ms for Ishmael's

514 ms for original
203 ms for Paul's
78 ms for Ike's
73 ms for Ishmael's

515 ms for original
218 ms for Paul's
78 ms for Ike's
63 ms for Ishmael's

515 ms for original
202 ms for Paul's
67 ms for Ike's
62 ms for Ishmael's

515 ms for original
218 ms for Paul's
78 ms for Ike's
62 ms for Ishmael's

获胜者 - Ishamel

当谈到速度时,获胜者似乎是伊斯梅尔,不仅在O(N + M)处获得理论上最快的解决方案[原始是O(N * M)],而且也是最微观效率的。< / p>

我相信他的解决方案明显优于我的解决方案。我只想提供比较所有这些后代的基准。

Paul的解决方案可能是现代C ++方面最优雅的解决方案,利用可用的标准和标准来用更高级别的逻辑替换内循环。速度并不总是(甚至通常)一切。

答案 2 :(得分:0)

注意:这是否更快,您必须分析代码并确定。此外,下面的程序已经针对所有输入案例测试了

如果您的目标是搜索不在字符串中的字符,而不是嵌套的搜索循环(和向量),则在循环中使用find_first_not_of将(应该)完成工作。

#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>

std::string optimize(std::string toBeProcessed, std::string toBeIgnored, char ch)
{
    std::transform(toBeProcessed.begin(), toBeProcessed.end(), toBeProcessed.begin(), ::toupper);
    std::transform(toBeIgnored.begin(), toBeIgnored.end(), toBeIgnored.begin(), ::toupper);
    std::string test;
    size_t start = 0;
    while (start < toBeProcessed.size())
    {
        size_t n = toBeProcessed.find_first_not_of(toBeIgnored, start);
        if ( n != std::string::npos)
        { 
            toBeProcessed[n] = ch;
            start = n+1;
        }
        else
            break;
    }
    return toBeProcessed;
}

int main()
{
    std::string out = optimize("abc123", "abc1", 'x');
    std::cout << out;
}

直播示例:http://ideone.com/RsB37f

尚未对所有输入进行测试,但这说明了基本观点。我正在替换字符串,而不是创建一个向量(甚至是一个新字符串)并从头开始构建字符串。

此外,我按值传递参数,因为使用符合C ++ 11的编译器执行此操作是有利的。即使你没有C ++ 11编译器,你也不会因此而失去任何东西,因为在你的原始例子中,你是将传入的字符串复制到局部变量。