我试图选择“故事”列包含列表“ selected_words”中任何字符串的行。
我尝试了多个选项,包括isin和str.contains,但是我通常只会得到错误,否则会得到一个空的数据框。
df4=pd.read_csv("https://drive.google.com/file/d/1rwg8c2GmtqLeGGv1xm9w6kS98iqgd6vW/view?usp=sharing")
df4["story"] = df4["story"].astype(str)
selected_words = ['accept', 'believe', 'trust', 'accepted', 'accepts',\
'trusts', 'believes', 'acceptance', 'trusted', 'trusting', 'accepting',\ 'believes', 'believing', 'believed', 'normal', 'normalize', ' normalized',\ 'routine', 'belief', 'faith', 'confidence', 'adoption', \
'adopt', 'adopted', 'embrace', 'approve', 'approval', 'approved', 'approves']
#At this point I am lost as to what to do next
根据我尝试执行的操作,我得到的是空数据框或错误消息。
答案 0 :(得分:1)
尝试一下。我无法加载您的DF。
df4[df4["story"].isin(selected_words)]
答案 1 :(得分:0)
在这里您可以看到解决方案https://stackoverflow.com/a/26577689/12322720
基本上,str.contains支持正则表达式,因此您可以使用or或管道连接
df4[df4.story.str.contains('|'.join(selected_words))]
答案 2 :(得分:0)
我目前正在自己学习更多熊猫,所以我想贡献我刚从book学到的答案。
可以使用Pandas系列创建“蒙版”,并使用它来过滤数据框。
import pandas as pd
# This URL doesn't return CSV.
CSV_URL = "https://drive.google.com/open?id=1rwg8c2GmtqLeGGv1xm9w6kS98iqgd6vW"
# Data file saved from within a browser to help with question.
# I stored the BitcoinData.csv data on my Minio server.
df = pd.read_csv("https://minio.apps.selfip.com/mymedia/csv/BitcoinData.csv")
selected_words = [
"accept",
"believe",
"trust",
"accepted",
"accepts",
"trusts",
"believes",
"acceptance",
"trusted",
"trusting",
"accepting",
"believes",
"believing",
"believed",
"normal",
"normalize",
" normalized",
"routine",
"belief",
"faith",
"confidence",
"adoption",
"adopt",
"adopted",
"embrace",
"approve",
"approval",
"approved",
"approves",
]
# %%timeit run in Jupyter notebook
mask = pd.Series(any(word in item for word in selected_words) for item in df["story"])
# results 18.2 ms ± 94.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# %%timeit run in Jupyter notebook
df[mask]
# results: 955 µs ± 6.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# %%timeit run in Jupyter notebook
df[df.story.str.contains('|'.join(selected_words))]
# results 129 ms ± 738 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# True for all
df[mask] == df[df.story.str.contains('|'.join(selected_words))]
# It is possible to calculate the mask inside of the index operation though of course a time penalty is taken rather than using the calculated and stored mask.
# %%timeit run in Jupyter notebook
df[[any(word in item for word in selected_words) for item in df["story"]]]
# results 18.2 ms ± 94.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# This is still faster than using the alternative `df.story.str.contains`
#
掩码搜索方式明显更快。