我有两个数据框:
dt1
我存储了数百万个注册表项。使用列定义:Index([u'count', u'id', u'is_malicious', u'key', u'name', u'value'], dtype='object')
dt2
我将这些注册表项与我们网络中的计算机相匹配。使用列定义:Index([u'id', u'machine_id', 'registry_key_id'], dtype='object')
迭代dt1
中所有行的最快方法是什么,对于每一行,计算在row['id']
列{{1}中找到dt2
次的次数}}吗
伪代码可以被认为是:
row['registry_key_id']
现在我们正在使用for row in dt1:
row['count'] = count(dt2[dt2['registry_key_id'] == row['id']]
,但在处理数十万行时速度相当慢。我们正在寻找大大加快这一过程的方法。
感谢您的协助。
更新1:
请参阅以下代码:
for index, row in panda.iterrows():
我们已确定count = count.groupby('registry_key_id').count()
res = hunter.registry_keys().copy(deep=True)
res['count'] = res['id'].map(count['id'])
和len(count) == len(res)
返回count['id']
的总次数。
但'registry_key_id'
中的所有值均为res['count']
。
你能帮忙解决这个问题吗?
答案:
使用Flab和Yarnspinner的反应组合,我能够减少计算和将大熊猫的映射时间从1小时缩短到12秒。谢谢!
答案 0 :(得分:3)
你可以尝试.map。在创建包含第二个数据帧中每个不同ID的计数的数据帧之后,可以在那里映射第一个数据帧中的reference_id。
import pandas as pd
import string
import time
df1=pd.DataFrame(data= {"id": ["a","b","c","d"]*5,"value":range(20)}, index = range(20))
df2=pd.DataFrame(data= {"id": ["a","a","a","b","b","c"]*10,"whatever" : range(60)})
df1_1 = df1.copy()
df2_1 = df2.copy()
t0 = time.clock()
reference_df2 = df2.groupby("id").count()
for index,row in df1.iterrows():
df1.loc[index] = (index,reference_df2["whatever"][1])
t1 = time.clock()
print "Simply assigning constant value from df2 with iterrows method: " + str(t1-t0)
# print df1
t0 = time.clock()
new_df2 = df2_1.groupby("id").count()
df1_1["id_count"] = df1_1["id"].map(new_df2["whatever"])
t1 = time.clock()
print "map method: " + str(t1-t0)
地图的速度非常快。
Simply assigning constant value from df2 with iterrows method: 0.0124636374812
map method: 0.00155283320419
答案 1 :(得分:2)
从Yarnspinner的回答开始,我同意你可以分两步来解决问题: 计算df2中的所有ID,然后将此信息映射到df1。
import pandas as pd
import string
df1=pd.DataFrame(data= {"id": ["a","b","c","d"]*5,"value":range(20)}, index = range(20))
df2=pd.DataFrame(data= {"id": ["a","a","a","b","b","c"]*10,"whatever" : range(60)})
count_dict = df2.groupby('id').count().to_dict()['whatever']
# If a key in df1 is not in df2, then assign a 0 count
# This part can probably be optimised but is not the purpose of the question
unique_df1_id = df1['id'].unique().tolist()
for key in unique_df1_id:
if key not in count_dict:
count_dict[key] = 0
#Here you create a new column containing the desider output
df1.loc[:, 'id count'] = df1['id'].replace(count_dict)
答案 2 :(得分:0)
我认为如果您执行左a
,则可以计算在'id'列上调用merge
的欺骗行为:
value_counts
答案 3 :(得分:0)
这样的事情对你有用吗?
matches = dt2[dt2.registry_key_id.isin(dt1.id)]
count = len(matches)