Question

我有一个以下格式的数据帧（实际上约有200,000行。其中~20％是活跃的 - ＆＃39; Y＆＃39;其余的是＆＃39; N＆＃39;）：

active  adtype           body      eng          first scan   id
N       Private Seller Car  Â Coupe Â 8cyl 4.7L     31/01/2016  SSE-AD-3469148
Y       Dealer: Near New    Â Coupe Â 12cyl 6.5L    31/01/2016  OAG-AD-12326299
N       Dealer: Used Car    Â Coupe Â 12cyl 6.5L    31/01/2016  OAG-AD-6834787

我正在创建一个ID列表，然后针对某些网站抓取数据进行交叉检查以查找新项目：

database_ids = database_records['id'].tolist() #simple list of ad IDs from CSV
database_ids = set(database_ids)
database_dicts = database_records.to_dict(orient='records') #Converted to list of dicts  
newads = []
adscrape_ids = []

#Search database for existing ads. Append new ads to 'newads'
 for ad in adscrape:
     ad['last scan'] = date
     ad['active'] = 'Y'
     adscrape_ids.append(ad['id'])
     if ad['id'] not in database_ids:
         ad['first scan'] = date
         print 'new ad:',ad
         newads.append(ad)

我希望通过将database_ids限制为仅处于活动状态的ID（＆＃39; Y＆＃39;）来加快此过程。是否有任何特定于熊猫的有效方法，或者我应该创建一个循环：

for row in database_dicts:
    if row['active'] == 'Y':
        database_ids.append(row['id'])
database_ids = set(database_ids)

Answer 1

你可以更有效地做到这一点（我愿意打赌你能够看到明显的速度差异）：

set(database_dicts[database_dicts.active == 'Y']['id'].unique())

database_dicts[database_dicts.active == 'Y']过滤并保留您想要的行。
.unique()将返回唯一值（在本例中为id列）。

一般情况下，当数据位于DataFrame中时，您应尽可能多地尝试 - 它比更高效，而不是循环和纯Python。

通过Pandas数据帧循环生成列表 - 最有效的方法

1 个答案: