Question

我只是不明白我做错了什么。下面显示的unicode tokenizer函数非常慢。也许有人可以给我一个如何加快速度的提示？谢谢你的帮助。顺便说一下，ustring是Glib::ustring。 sep1是不应出现在结果中的分隔符 sep2是结果

中应为单个标记的分隔符

void tokenize(const ustring & u, const ustring & sep1, 
        const ustring & sep2, vector<ustring> & tokens) {
    ustring s;
    s.reserve(100);
    ostringstream os;
    gunichar c;
    for (int i = 0; i < u.length(); i++) {
        c = u[i];
        if (sep1.find(c) != ustring::npos) {
            tokens.push_back(s);
            s = "";
        }
        else if (sep2.find(c) != ustring::npos) {
            tokens.push_back(s);
            s = "";
            s.append(1, c);
            tokens.push_back(s);
            s = "";
        }
        else {
            s.append(1, c);
        }
    }
    if (s!="")
    tokens.push_back(s);
}

我现在将其更改为（现在介于1到2秒之间）：

ustring s;
s.reserve(100);
ostringstream os;
gunichar c;

set<gunichar> set_sep1;
int i=0;

for (i=0;i<sep1.size();i++)
{
    set_sep1.insert(sep1[i]);
}

set<gunichar> set_sep2;
for (i=0;i<sep2.size();i++)
{
    set_sep2.insert(sep2[i]);
}

int start_index=-1;
int ulen=u.length();
i=0;
for (ustring::const_iterator it=u.begin();it!=u.end();++it)
{
    c=*it;
    if (set_sep1.find(c)!=set_sep1.end())
    {
        if (start_index!=-1 && start_index<i)
            tokens.push_back(u.substr(start_index,i-start_index));
        start_index=i+1;
        s="";
    }
    else if (set_sep2.find(c)!=set_sep2.end())
    {
        tokens.push_back(s);
        s="";
        tokens.push_back(s);
        start_index=i+1;
        s="";
    }
    i++;
}
if (start_index!=-1 && start_index<ulen)
    tokens.push_back(u.substr(start_index,ulen-start_index));

Answer 1

这里可能“非常慢”的事情是：

ustring::length()
ustring::append()
通过ustring随机访问operator[]：例如c=u[i];

尝试以下方法：

不是在循环中调用u.length()，而是将长度存储在变量中，并在循环中与该变量进行比较
将当前令牌的字符附加到ostringstream或wostringstream而不是ustring
使用迭代器而不是涉及随机访问的索引来遍历ustring。

示例：

for(ustring::const_iterator it = u.cbegin(); it != u.cend(); it++)
{
    c = *it;
    //implementation follows
}

Answer 2

我认为以下内容会显着加快您的代码速度，但要找出它是一件小事。目前你是：

迭代你的每个角色。
在sep1中执行查找以查看该字符是否属于分隔符。
根据需要一次附加一个字符。

假设您的分隔符列表小于您要解析的字符串，那么最好不要执行以下操作：

对于每个分隔符，执行查找以查看分隔符是否在字符串
如果找到，请一次追加整个子字符串，并对剩余的子字符串进行查找。

第二个优化是按最有可能成功的方式订购分隔符。如果是例如“，”是最常用的分隔符，请确保查找首先运行。如果一个分隔符比其他分隔符更热，这将产生很大的不同。

C ++ - Tokenizer非常慢

2 个答案: