如何避免两个for
循环并优化我的代码才能处理大数据?
import pandas as pd
import numpy as np
array = np.array([[1,'aaa','bbb'],[2,'ccc','bbb'],[3,'zzzz','bbb'],[4,'eee','zzzz'],[5,'ccc','bbb'],[6,'zzzz','bbb'],[7,'aaa','bbb']])
df= pd.DataFrame(array)
l=[]
for i in range(len(df)):
for j in range(i+1,len(df)):
if (df.loc[i][1] == df.loc[j][1]) & (df.loc[i][2] == df.loc[j][2]):
l.append((df.loc[i][0],df.loc[j][0]))
答案 0 :(得分:0)
您可以按列[1,2]
进行分组,然后汇总列0
中的值,如下所示:
In [91]: df.groupby([1,2])[0] \
.agg(lambda x: tuple(x.values) if len(x)>1 else np.nan) \
.dropna() \
.tolist()
Out[91]: [('1', '7'), ('2', '5'), ('3', '6')]
答案 1 :(得分:0)
按第二和第三列分组。然后使用组合函数:chain
和combinations
。
from itertools import combinations
list(chain(*df.groupby(by=[1, 2])[0].apply(lambda x: combinations(x, 2))))
[('1', '7'), ('2', '5'), ('3', '6')]
稍微更改数据集。
array = np.array([[1,'aaa','bbb'],[2,'ccc','bbb'],[3,'zzzz','bbb'],
[4,'eee','zzzz'],[5,'ccc','bbb'],[6,'zzzz','bbb'],
[7,'aaa','bbb'], [8, "aaa", "bbb"], [9, 'zzzz','bbb']])
df = pd.DataFrame(array)
list(chain(*df.groupby(by=[1, 2])[0].apply(lambda x: combinations(x, 2))))
[('1', '7'),
('1', '8'),
('7', '8'),
('2', '5'),
('3', '6'),
('3', '9'),
('6', '9')]
list(chain(*df.groupby(by=[1, 2])[0].apply(lambda x: combinations(x, 2))))
1.67 ms ± 34.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)