我试图从包含大量特殊字符,空格和网址的字符串中过滤掉网址。我试图使用正则表达式,但它失败了,它管理有时排队网址但输出仍然包含特殊字符和空格,所以我在这里。最好的问候P
string str;
std::ifstream in("c:/Users/Petrus/Documents/History", std::ios::binary);
std::stringstream buffer;
if (!in.is_open()){
cout << "Failed to open" << endl;
}
else{
cout << "Opened OK" << endl;
}
buffer << in.rdbuf();
std::string contents(buffer.str());
std::ofstream out("urls.txt");
unsigned counter = 0;
std::regex word_regex(
R"(^(([^:\/?#]+):)?(//([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?)",
std::regex::extended
);
auto words_begin = std::sregex_iterator(contents.begin(), contents.end(), word_regex);
auto words_end = std::sregex_iterator();
for (std::sregex_iterator i = words_begin; i != words_end; ++i) {
std::smatch match = *i;
std::string match_str = match.str();
for (const auto& res : match) {
counter++;
std::cout << counter++ << ": " << res << std::endl;
}
std::cout << " " << match_str << '\n';
}
system("PAUSE");
return 0;
}
答案 0 :(得分:1)
简化(和调试)正则表达式的几个步骤:
(?<groupname>regex)
来帮助确定哪些内容和访问结果。()
,(?:regex)
使用&#34;不记得&#34;捕获,也有助于澄清发生了什么一旦完成,只需进行一些调整即可修复&#34;所有输入的正则表达式:
(?<protocol>https?:\/\/)(?:(?<urlroot>[^\/?#\n\s]+))?(?<urlResource>[^?#\n\s]+)?(?<queryString>\?(?:[^#\n\s]*))?(?:#(?<fragment>[^\n\s]))?
[^#\n\s]
regex demo输出:
和匹配组(截断但全部在那里):