我的最终目标是创建一个遍历多个文件的for循环,以及一个将条件索引与数据帧进行比较的附加for循环。为了使这一过程更加有趣,我还包括一个函数,因为我可能必须将相同的原理应用于同一数据帧中的另一个变量。有一些问题。
in
语句是否足够。 isin
的语句,但是列表中的每个单词都需要对照数据帧的一行进行检查。但是,我不确定在尝试执行此类操作时如何应用它... df:
'headline' 'source'
targets is making better stars in the bucks target news
more diamonds than rocks in saturn rings wishful thinking
diamond in the rough employees take too many naps refresh sleep
data:
'company'
targets
stars in the bucks
wallymarty
velocity global
diamond in the rough
ccompanies = data['company'].tolist() #convert into list
def find(x): #function to compare df['headline'] against list of companies
result = []
companies = set(ccompanies) #edit based on comment, saves time
for i in companies:
if i in x:
result.append(x)
return result
matches = df['headline'].apply(find)
所需的输出将是与公司匹配的标题列表:
targets is making better stars in the bucks
diamond in the rough employees take too many naps
编辑:我的脚本已被编辑,因此现在可以正常工作并显示标题。但是,输出不仅显示所需的输出,还显示了数据帧的所有行,仅填充了适用的行。
答案 0 :(得分:1)
...在这种情况下,或者如果简单的in语句就足够了,应该使用正则表达式?
使用in
很好,因为您显然已经标准化为.lower()
并删除了标点符号。
您确实应该尝试使用更有意义的标识符。例如,惯用语不是i
,而是for company in companies:
。
您已经弄清楚了如何使用.tolist()
,这很好。但是您真的想创建set
而不是list
,以支持有效的in
测试。这是O(1)哈希查找与用于列表线性扫描的嵌套循环之间的区别。
这毫无意义:
for i in ccompanies:
i = [x]
您开始进行迭代,但是i
本质上是一个常数吗?目前尚不清楚您要干什么。
如果您将此项目再进一步一点,则可以考虑与NLTK相匹配的公司 或来自scikit-learn的TfidfVectorizer, 或https://pypi.org/project/fuzzywuzzy/
答案 1 :(得分:0)
在纯熊猫中,无需迭代并转换为列表。
首先,将data
与df
连接起来,以使标题与每个要比较的公司名称“重复”。临时列“键”用于简化此连接。
In [60]: data_df = data.to_frame()
In [61]: data_df['key'] = 1
In [63]: df['key'] = 1
In [65]: merged = pd.merge(df, data_df, how='outer', on='key').drop('key', axis=1)
merged
将如下所示。如您所见,根据data
的大小,使用此方法可能会获得巨大的DataFrame。
In [66]: merged
Out[66]:
headline source company
0 targets is making better stars in the bucks target news targets
1 targets is making better stars in the bucks target news stars in the bucks
2 targets is making better stars in the bucks target news wallymarty
3 targets is making better stars in the bucks target news velocity global
4 targets is making better stars in the bucks target news diamond in the rough
5 more diamonds than rocks in saturn rings wishful thinking targets
6 more diamonds than rocks in saturn rings wishful thinking stars in the bucks
7 more diamonds than rocks in saturn rings wishful thinking wallymarty
8 more diamonds than rocks in saturn rings wishful thinking velocity global
9 more diamonds than rocks in saturn rings wishful thinking diamond in the rough
10 diamond in the rough employees take too many naps refresh sleep targets
11 diamond in the rough employees take too many naps refresh sleep stars in the bucks
12 diamond in the rough employees take too many naps refresh sleep wallymarty
13 diamond in the rough employees take too many naps refresh sleep velocity global
14 diamond in the rough employees take too many naps refresh sleep diamond in the rough
然后在标题中查找文本。如果找到,则在新的“找到”列中输入True,否则为False。
In [67]: merged['found'] = merged.apply(lambda x: x['company'] in x['headline'], axis=1)
然后删除找不到匹配项的标题:
In [68]: found_df = merged.drop(merged[merged['found']==False].index)
In [69]: found_df
Out[69]:
headline source company found
0 targets is making better stars in the bucks target news targets True
1 targets is making better stars in the bucks target news stars in the bucks True
14 diamond in the rough employees take too many naps refresh sleep diamond in the rough True
如有必要,仅汇总到标题和公司
In [70]: found_df[['headline', 'company']]
Out[70]:
headline company
0 targets is making better stars in the bucks targets
1 targets is making better stars in the bucks stars in the bucks
14 diamond in the rough employees take too many naps diamond in the rough
快捷方式:可以使用此命令总结步骤67直到结束
merged.drop(merged[merged.apply(lambda x: x['company'] in x['headline'], axis=1) == False].index)[['headline', 'source']]