这是this previous topic的后续行动。我有一系列名为"英国"像这样:
British
\bSkilful\b
\bWilful\b
\bfulfil\b
\b.*favour.*\b
\bappal\b
\bappall.*\b
\barbour.*\b
\barmor.*\b
\bstrange\b
\brumor.*\b
\b.*color.*\b
\b.*centre's\b
和这样的DataFrame df:
User_ID Tweet
01 hi all
02 see you something
03 that's my favourite spot
04 the strangest rumors
05 my appal is nice
06 check my rumor
07 #brborboncheckruMoreThanever
08 look @mycentre's
我想获得一个包含字符串中找到的SINGLE关键字的新列。到目前为止我做了:
List = pd.read_csv('w.txt')
r = re.compile(r'.*({}).*'.format('|'.join(List['British'].values)), re.IGNORECASE)
然后屏蔽DataFrame:
masked = map(bool, map(r.search, df['Tweet']))
df2 = df[masked]
然后我再次屏蔽它以添加'关键字'柱:
mask = [m.group(1) if m else None for m in map(r.search, df2['Tweet'])]
df2['keyword'] = mask
返回:
User_ID Tweet keyword
2 3 that's my favourite spot favourite spot
4 5 my appal is nice appal
5 6 check my rumor rumor
7 8 look @mycentre's mycentre's
因此布尔掩码工作正常,只检测包含至少一个关键字的推文。但是,如果我只想提取找到的单个关键字呢?最终的DataFrame应该是:
User_ID Tweet keyword
2 3 that's my favourite spot favourite
4 5 my appal is nice appal
5 6 check my rumor rumor
7 8 look @mycentre's centre's
非常感谢您的帮助。