我正在浏览标题(sp500news)的数据框列,并与公司名称(co_names_df)的数据框进行比较。每当公司名称出现在标题中时,我都会尝试更新频率。
我当前的代码在下面,并且没有更新频率列。有没有更清洁,更快的实现-也许没有for循环?
for title in sp500news['title']:
for string in title:
for co_name in co_names_df['Name']:
if string == co_name:
co_names_index = co_names_df.loc[co_names_df['Name']=='string'].index
co_names_df['Frequency'][co_names_index] += 1
co_names_df示例
Name Frequency
0 3M 0
1 A.O. Smith 0
2 Abbott 0
3 AbbVie 0
4 Accenture 0
5 Activision 0
6 Acuity Brands 0
7 Adobe Systems 0
...
sp500news ['title']示例
title
0 Italy will not dismantle Montis labour reform minister
1 Exclusive US agency FinCEN rejected veterans in bid to hire lawyers
4 Xis campaign to draw people back to graying rural China faces uphill battle
6 Romney begins to win over conservatives
8 Oregon mall shooting survivor in serious condition
9 Polands PGNiG to sign another deal for LNG supplies from US CEO
答案 0 :(得分:1)
您可能可以加快速度;您正在使用其他结构会更好工作的数据框。这是我会尝试的。
from collections import Counter
counts = Counter()
# checking membership in a set is very fast (O(1))
company_names = set(co_names_df["Name"])
for title in sp500news['title']:
for word in title: # did you mean title.split(" ")? or is title a list of strings?
if word in company_names:
counts.update([word])
counts
然后是字典{company_name: count}
。您可以对元素进行快速循环以更新数据框中的计数。