Question

我正在浏览标题（sp500news）的数据框列，并与公司名称（co_names_df）的数据框进行比较。每当公司名称出现在标题中时，我都会尝试更新频率。

我当前的代码在下面，并且没有更新频率列。有没有更清洁，更快的实现-也许没有for循环？

for title in sp500news['title']:
    for string in title:
        for co_name in co_names_df['Name']:
            if string == co_name:
                co_names_index = co_names_df.loc[co_names_df['Name']=='string'].index
                co_names_df['Frequency'][co_names_index] += 1

co_names_df示例

    Name    Frequency
0   3M  0
1   A.O. Smith  0
2   Abbott  0
3   AbbVie  0
4   Accenture   0
5   Activision  0
6   Acuity Brands   0
7   Adobe Systems   0                 
               ...

sp500news ['title']示例

title  
0       Italy will not dismantle Montis labour reform  minister                            
1       Exclusive US agency FinCEN rejected veterans in bid to hire lawyers                
4       Xis campaign to draw people back to graying rural China faces uphill battle        
6       Romney begins to win over conservatives                                            
8       Oregon mall shooting survivor in serious condition                                 
9       Polands PGNiG to sign another deal for LNG supplies from US CEO

Answer 1

您可能可以加快速度；您正在使用其他结构会更好工作的数据框。这是我会尝试的。

from collections import Counter

counts = Counter()

# checking membership in a set is very fast (O(1))
company_names = set(co_names_df["Name"])

for title in sp500news['title']:
    for word in title: # did you mean title.split(" ")? or is title a list of strings?
        if word in company_names:
            counts.update([word])

counts然后是字典{company_name: count}。您可以对元素进行快速循环以更新数据框中的计数。

嵌套与熊猫数据框的循环

1 个答案: