我有一个包含900个元素的列表,我希望将每个元素与另一个元素组合起来,然后基本上处理它们,我有900X900的矩阵。 现在,我正在这样做:
Col_List = [l1,l2.......l900] # l1, l2 are list of columns in a dataframe
Sim_Dict = {}
with open('list.pickle', 'wb') as handle:
pickle.dump(Col_List, handle, protocol=pickle.HIGHEST_PROTOCOL)
start = datetime.datetime.now()
for i, LO in enumerate(Col_List):
LO_list = [LO for x in range(len(Col_List))]
pool = mp.Pool()
Sim_Dict[LO] = pool.map(Similarity, zip(LO_list,Col_List))
if(i%25 == 0):
print('done:', LO)
with open('dictionary.pickle2', 'wb') as handle:
pickle.dump(Sim_Dict, handle, protocol=pickle.HIGHEST_PROTOCOL)
end = datetime.datetime.now()
print("timeTaken: ", end-start)
def Similarity(lists):
"""
Similarity between item1 and item2
Input: item i and item j
Output: Average Similarity score between two items
"""
item1, item2 = lists[0], lists[1]
if (item1==item2):
return 1
else:
dataframe = pd.read_csv('df_Student_User')
df_test = dataframe[(dataframe.loc[:, item1] != dataframe.loc[:, item2])
& (dataframe.loc[:, item1].notnull())
& (dataframe.loc[:, item2].notnull())
].loc[:, [item1, item2]]
df_test[item1] = pd.to_datetime(df_test[item1]) # converting to date time
df_test[item2] = pd.to_datetime(df_test[item2])
# Delta_t
df_test['Delta_t'] = abs(((df_test.loc[:, item1] -
df_test.loc[:, item2]
).dt.total_seconds() / 3600).astype(int))
# Delta D
df_test['Delta_D'] = abs(df_test[[item1, item2]].min(axis=1) -
dt.datetime.now().date()
).astype('timedelta64[h]').astype('int')
df_test['S_ij'] = 1 / (df_test.Delta_t + df_test.Delta_D)
SimScore = df_test.S_ij.sum(axis=0)
del df_test
return SimScore
我可以更优化它吗?为了快速返回结果? 至于100X100,它的方法大约需要10分钟,而简单的数学运算,对于900X900则需要大约13个小时。
Similarity()大约需要0.1400秒才能找到两个随机项之间的相似性。