迭代两个熊猫数据帧的最快方法

时间:2016-06-15 10:44:33

标签: python pandas dataframe

我有两个数据框:

dt1我存储了数百万个注册表项。使用列定义:Index([u'count', u'id', u'is_malicious', u'key', u'name', u'value'], dtype='object')

dt2我将这些注册表项与我们网络中的计算机相匹配。使用列定义:Index([u'id', u'machine_id', 'registry_key_id'], dtype='object')

迭代dt1中所有行的最快方法是什么,对于每一行,计算在row['id']列{{1}中找到dt2次的次数}}吗

伪代码可以被认为是:

row['registry_key_id']

现在我们正在使用for row in dt1: row['count'] = count(dt2[dt2['registry_key_id'] == row['id']] ,但在处理数十万行时速度相当慢。我们正在寻找大大加快这一过程的方法。

感谢您的协助。

更新1:

请参阅以下代码:

for index, row in panda.iterrows():

我们已确定count = count.groupby('registry_key_id').count() res = hunter.registry_keys().copy(deep=True) res['count'] = res['id'].map(count['id']) len(count) == len(res)返回count['id']的总次数。

'registry_key_id'中的所有值均为res['count']

你能帮忙解决这个问题吗?

答案:

使用Flab和Yarnspinner的反应组合,我能够减少计算和将大熊猫的映射时间从1小时缩短到12秒。谢谢!

4 个答案:

答案 0 :(得分:3)

你可以尝试.map。在创建包含第二个数据帧中每个不同ID的计数的数据帧之后,可以在那里映射第一个数据帧中的reference_id。

import pandas as pd
import string
import time

df1=pd.DataFrame(data= {"id": ["a","b","c","d"]*5,"value":range(20)}, index = range(20))
df2=pd.DataFrame(data= {"id": ["a","a","a","b","b","c"]*10,"whatever" : range(60)})

df1_1 = df1.copy()
df2_1 = df2.copy()

t0 = time.clock()
reference_df2 = df2.groupby("id").count()
for index,row in df1.iterrows():
    df1.loc[index] = (index,reference_df2["whatever"][1])
t1 = time.clock()
print "Simply assigning constant value from df2 with iterrows method: " + str(t1-t0)
# print df1

t0 = time.clock()
new_df2 = df2_1.groupby("id").count()
df1_1["id_count"] = df1_1["id"].map(new_df2["whatever"])
t1 = time.clock()
print "map method: " + str(t1-t0)

地图的速度非常快。

Simply assigning constant value from df2 with iterrows method: 0.0124636374812
map method: 0.00155283320419

答案 1 :(得分:2)

从Yarnspinner的回答开始,我同意你可以分两步来解决问题: 计算df2中的所有ID,然后将此信息映射到df1。

import pandas as pd
import string

df1=pd.DataFrame(data= {"id": ["a","b","c","d"]*5,"value":range(20)}, index =   range(20))
df2=pd.DataFrame(data= {"id": ["a","a","a","b","b","c"]*10,"whatever" : range(60)})


count_dict = df2.groupby('id').count().to_dict()['whatever']

# If a key in df1 is not in df2, then assign a 0 count
# This part can probably be optimised but is not the purpose of the question

unique_df1_id = df1['id'].unique().tolist()
for key in unique_df1_id:
    if key not in count_dict:
         count_dict[key] = 0

#Here you create a new column containing the desider output
df1.loc[:, 'id count'] = df1['id'].replace(count_dict)

答案 2 :(得分:0)

我认为如果您执行左a,则可以计算在'id'列上调用merge的欺骗行为:

value_counts

答案 3 :(得分:0)

这样的事情对你有用吗?

matches = dt2[dt2.registry_key_id.isin(dt1.id)]
count = len(matches)