我有一个未格式化的推文列表(只是从网站 en masse 中复制粘贴),我正在尝试将每条推文分开到各自的行,同时取出所有其他推文文本文件中的无关细节。
我目前有一个正则表达式字符串,当我在notepad ++中搜索时可以正常工作,但由于某些原因我无法通过C ++工作。
我正在搜索的文本示例如下:
Autotestdrivers.com @testdrivernews Nov 6
Tesla Model S third row of seats confuses,… http://dlvr.it/CgTbsL #children #models #police #tesla #teslamotors #Autos #Car #Trucks
1 retweet 3 likes
Gina Stark ✈ @SuuperG Nov 6
Ha! Kinda. @PowayAutoRepair I have a thing for "long-nose" cars; #Porsche #Jaguar #Ferrari , and I love the lines of a #Tesla!
View conversation
0 retweets 2 likes
Tony Morice @WestLoopAuto Nov 6
\#WeirdCarNews via @Therealautoblog Tesla Model S third row of seats confuses, delights police http://www.autoblog.com/2015/11/06/tesla-model-s-third-row-seats-police/ …
View summary
0 retweets 0 likes
我正在使用的正则表达式是发布推文的日期和推文本身,看起来像这样:
[A-Z][a-z][a-z] \d+\r\n\r\n *.+\r\n
...但由于某种原因,我无法在我的代码中使用它。
#include <fstream>
#include <iostream>
#include <string>
#include <regex>
std::regex rgx("[A-Z][a-z][a-z]\\d+\\r\\n\\r\\n *.+\\r\\n");
std::string Location_Of_Tweet = "put location here";
std::smatch match;
std::cout << twitterFile;
std::ifstream twitterFiler;
twitterFiler.open(Location_Of_Tweet,std::ifstream::in);
const std::string tweetFile((std::istreambuf_iterator<char>(twitterFiler)), std::istreambuf_iterator<char>());
if (std::regex_search(tweetFile.begin(), tweetFile.end(), match, rgx))
{
std::cout << "Match\n";
for (auto m : match)
std::cout << " submatch " << m << '\n';
}
else
std::cout << "No match\n";
答案 0 :(得分:1)
这个正则表达式假设c ++ 11正则表达式理解水平空格\h
如果没有,请将所有\h
替换为[^\S\r\n]
。
这很容易解释为可行的方法 但是,你需要一个更实质的分隔符来分隔推文。
"(?m)([A-Z][a-z][a-z]\\h+\\d+)\\h*\\r?\\n\\s*^\\h*(?=\\S)(.+)"
解释
(?m) # Multi-line mode
( [A-Z] [a-z] [a-z] \h+ \d+ ) # (1), Date
\h* \r? \n \s* # Line break, any number of whitespace
^ \h* # Beginning of line
(?= \S ) # Next, first non-whitespace
( .+ ) # (2), Tweet
使用您的样本测试案例 输出
** Grp 1 - ( pos 37 , len 5 )
Nov 6
** Grp 2 - ( pos 46 , len 132 )
Tesla Model S third row of seats confuses,… http://dlvr.it/CgTbsL #children #models #police #tesla #teslamotors #Autos #Car #Trucks
-----------------
** Grp 1 - ( pos 226 , len 5 )
Nov 6
** Grp 2 - ( pos 235 , len 126 )
Ha! Kinda. @PowayAutoRepair I have a thing for "long-nose" cars; #Porsche #Jaguar #Ferrari , and I love the lines of a #Tesla!
-----------------
** Grp 1 - ( pos 435 , len 5 )
Nov 6
** Grp 2 - ( pos 444 , len 170 )
\#WeirdCarNews via @Therealautoblog Tesla Model S third row of seats confuses, delights police http://www.autoblog.com/2015/11/06/tesla-model-s-third-row-seats-police/ …