我有一个函数接收一个句子,并根据空格标记为单词" &#34 ;. 现在,我想改进功能以消除一些特殊字符,例如:
I am a boy. => {I, am, a, boy}, no period after "boy"
I said :"are you ok?" => {I, said, are, you, ok}, no question and quotation mark
原始功能在这里,我该如何改进?
void Tokenize(const string& str, vector<string>& tokens, const string& delimiters = " ")
{
string::size_type lastPos = str.find_first_not_of(delimiters, 0);
string::size_type pos = str.find_first_of(delimiters, lastPos);
while (string::npos != pos || string::npos != lastPos)
{
tokens.push_back(str.substr(lastPos, pos - lastPos));
lastPos = str.find_first_not_of(delimiters, pos);
pos = str.find_first_of(delimiters, lastPos);
}
}
答案 0 :(得分:0)
您可以使用std::regex
。您可以在此处搜索任何内容,然后将结果放入向量中。那很简单。
请参阅:
#include <iostream>
#include <string>
#include <algorithm>
#include <vector>
#include <regex>
// Our test data (raw string). So, containing also \" and so on
std::string testData(R"#(I said :"are you ok?")#");
std::regex re(R"#((\b\w+\b,?))#");
int main(void)
{
// Define the variable id as vector of string and use the range constructor to read the test data and tokenize it
std::vector<std::string> id{ std::sregex_token_iterator(testData.begin(), testData.end(), re, 1), std::sregex_token_iterator() };
// For debug output. Print complete vector to std::cout
std::copy(id.begin(), id.end(), std::ostream_iterator<std::string>(std::cout, " "));
return 0;
}