我正在使用数据框中的Twitter数据。我想根据在文本中找到的某个关键字过滤保存每个推文文本的列。
我已经尝试过str.contains,但是那行不通,因为该列是Series。我想为所有包含关键字“ remoaners”的推文过滤“文本”列。
remoaners_only = time_plus_text[time_plus_text["text"].str.contains("remoaners", case=False, na=False)]
这将产生一个空的数据框或许多NaN。
熊猫版本为0.24.1。
以下是输入数据:time_plus_text [“ text”]。head(10)
0 [ #bbcqt Remoaners on about post Brexit racial...
1 [@sarahwollaston Shut up, you like all remoane...
2 [ what have the Brextremists ever done for us ...
3 [ Remoaner in bizarre outburst ]
4 [ Anyone who disagrees with brexit is called n...
5 [ @SkyNewsBreak They forecasted if the vote wa...
6 [ but we ARE LEAVING THE #EU, even the #TORIES...
7 [ Can unelected Remoaner peers not see how abs...
8 [@sizjam68 @LeaveEUOfficial @johnredwood It wo...
9 [ Hey @BBC have you explained why when award w...
Name: text, dtype: object
答案 0 :(得分:0)
您的代码有效。因此,您需要检查输入数据或熊猫错误修复版本0.24.1与0.24.2。
0.24.2
index text
0 0 [ #bbcqt Remoaners on about post Brexit rac...
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
print(pd.__version__)
csvdata = StringIO("""0, [ #bbcqt Remoaners on about post Brexit racial...
1, [@sarahwollaston Shut up, you like all remoane...
2, [ what have the Brextremists ever done for us ...
3, [ Remoaner in bizarre outburst ]
4, [ Anyone who disagrees with brexit is called n...
5, [ @SkyNewsBreak They forecasted if the vote wa...
6, [ but we ARE LEAVING THE #EU, even the #TORIES...
7, [ Can unelected Remoaner peers not see how abs...
8, [@sizjam68 @LeaveEUOfficial @johnredwood It wo...
9, [ Hey @BBC have you explained why when award w...""")
df = pd.read_csv(csvdata, names=["index", "text"], sep=",")
result = df[df["text"].str.contains("remoaners", case=False, na=False)]
# results
print(result)
答案 1 :(得分:0)
问题是,您要在其中查找子字符串remoaners
的字符串包含在每个单元格的list
中。您需要先执行str[0]
来访问此字符串,然后再执行str.contains
,例如:
# input
time_plus_text = pd.DataFrame({'text':[['#bbcqt Remoaners on about post Brexit racial...'],
['@sarahwollaston Shut up, you like all remoaners...'],
['what have the Brextremists ever done for us ...']]})
print (time_plus_text["text"].str[0].str.contains("remoaners", case=False, na=False))
0 True
1 True
2 False
Name: text, dtype: bool
所以您应该这样做:
remoaners_only = time_plus_text[time_plus_text["text"].str[0]\
.str.contains("remoaners", case=False, na=False)]