我有一组目标字符串及其替代品:
" </3" "\xf0\x9f\x92\x94"
" <3 " "\xf0\x9f\x92\x97"
" 8-D" "\xf0\x9f\x98\x81"
" 8D " "\xf0\x9f\x98\x81"
" x-D" "\xf0\x9f\x98\x81"
" xD " "\xf0\x9f\x98\x81"
" :')" "\xf0\x9f\x98\x82"
":'-)" "\xf0\x9f\x98\x82"
":-))" "\xf0\x9f\x98\x83"
" 8) " "\xf0\x9f\x98\x84"
" :) " "\xf0\x9f\x98\x84"
" :-)" "\xf0\x9f\x98\x84"
" =) " "\xf0\x9f\x98\x84"
" =] " "\xf0\x9f\x98\x84"
" 0:)" "\xf0\x9f\x98\x87"
"0:-)" "\xf0\x9f\x98\x87"
...
它们是表情符号和相应的表情符号Unicode字符的十六进制表示。我填充了空格,以便表情符号和替换表情符号字符串都是4个字节长。
我希望用输入文件中相应的表情符号字符串替换表情符号。这样做最有效的方式是什么?
我想到了两种方法:
3
,D
,)
,]
或任何终止表情字符串的其他字符到达时,检查最后三个字符,看它是否是一个有效的表情符号,然后用相应的表情符号字符串替换它们。<regex.h>
),并将所有正则表达式一个接一个地应用于完整文本,并应用替换。第二种方法听起来非常慢,但代码应该很容易。像(伪代码,请原谅语法错误):
struct emoticon_replacement {
regex_t* regex;
char *targ, *repl;
};
struct emoticon_replacement replacements[] = {
{NULL, ...., ....},
{NULL, ...., ....},
{NULL, ...., ....},
....
};
// followed by regex initialization, taking advantage
// of sizeof(replacements)
// And again take advantage of sizeof(replacements) to loop
// over the regexes and replace occurences
如果可以的话,第一种方法应该更快:
std::map
。如何有效地实施这些方法?我还有其他选择吗?
完全披露:
这是测试的一部分。
答案 0 :(得分:1)
如果数据有可能发生变化,您可能想要使用某种地图(例如散列表或特里)。这里似乎没有必要涵盖这些理论,特别是考虑到问题中没有提到这个选项......我只是想提到一些值得思考的问题。
否则,数据没有变化的可能性,我强烈建议使用排序查找表,这是您的第一个选项的优化版本,这样您就可以使用二进制搜索而不是从数组的开头搜索到结尾。例如:
struct replacement {
char original[4];
char replacement[4];
};
int compare_replacement(void const *x, void const *y) {
struct replacement const *fu = x, *ba = y;
return memcmp(x->original, y->original, 4);
}
int main(void) {
struct replacement table[] = {
{ .original = " </3" , .replacement = "\xf0\x9f\x92\x94" },
{ .original = " <3 " , .replacement = "\xf0\x9f\x92\x97" },
{ .original = " 8-D" , .replacement = "\xf0\x9f\x98\x81" },
{ .original = " 8D " , .replacement = "\xf0\x9f\x98\x81" },
{ .original = " x-D" , .replacement = "\xf0\x9f\x98\x81" },
{ .original = " xD " , .replacement = "\xf0\x9f\x98\x81" },
{ .original = " :')" , .replacement = "\xf0\x9f\x98\x82" },
{ .original = ":'-)" , .replacement = "\xf0\x9f\x98\x82" },
{ .original = ":-))" , .replacement = "\xf0\x9f\x98\x83" },
{ .original = " 8) " , .replacement = "\xf0\x9f\x98\x84" },
{ .original = " :) " , .replacement = "\xf0\x9f\x98\x84" },
{ .original = " :-)" , .replacement = "\xf0\x9f\x98\x84" },
{ .original = " =) " , .replacement = "\xf0\x9f\x98\x84" },
{ .original = " =] " , .replacement = "\xf0\x9f\x98\x84" },
{ .original = " 0:)" , .replacement = "\xf0\x9f\x98\x87" },
{ .original = "0:-)" , .replacement = "\xf0\x9f\x98\x87" }
};
qsort(table, sizeof table / sizeof *table, sizeof *table, compare_replacement);
}
然后你应该能够从字符串的开头迭代到字符串的结尾,使用bsearch
来测试每个连续的四个字节,例如:
void replace_emotes(char *str, struct replacement *rep, size_t rep_size) {
while (*str) {
struct replacement query = { 0 };
strncpy(query.original, rep, sizeof query.original);
struct replacement *response = bsearch(&query, rep, rep_size, sizeof *rep, compare_replacement);
if (response) {
strncpy(str, response->replacement, sizeof response->replacement);
}
}
}
如果您打算支持插入,则需要使用realloc
作为开始...并找出插入位置以在插入之前保持数组排序,或者在每次插入后使用数组。其中任何一个都可以很好地修改(插入和删除)到小集合,但如果你打算支持更大的集合,你可能想要使用像trie或hashtable这样的东西。