优化python中的计算

时间:2017-06-08 23:24:49

标签: python python-3.x pandas optimization

我有一个包含900个元素的列表,我希望将每个元素与另一个元素组合起来,然后基本上处理它们,我有900X900的矩阵。 现在,我正在这样做:

Col_List = [l1,l2.......l900] # l1, l2 are list of columns in a dataframe

Sim_Dict = {}


with open('list.pickle', 'wb') as handle:
    pickle.dump(Col_List, handle, protocol=pickle.HIGHEST_PROTOCOL)
    start = datetime.datetime.now()
    for i, LO in enumerate(Col_List):
        LO_list = [LO for x in range(len(Col_List))]
        pool = mp.Pool()
        Sim_Dict[LO] = pool.map(Similarity, zip(LO_list,Col_List))
        if(i%25 == 0):
            print('done:', LO)
with open('dictionary.pickle2', 'wb') as handle:
    pickle.dump(Sim_Dict, handle, protocol=pickle.HIGHEST_PROTOCOL)
end = datetime.datetime.now()
print("timeTaken: ", end-start)

def Similarity(lists):
    """
    Similarity between item1 and item2
    Input: item i and item j
    Output: Average Similarity score between two items
    """
    item1, item2 = lists[0], lists[1]
    if (item1==item2):
        return 1
    else:
        dataframe = pd.read_csv('df_Student_User')
        df_test = dataframe[(dataframe.loc[:, item1] != dataframe.loc[:, item2])
                            & (dataframe.loc[:, item1].notnull())
                            & (dataframe.loc[:, item2].notnull())
                           ].loc[:, [item1, item2]]
        df_test[item1] = pd.to_datetime(df_test[item1])  # converting to date time
        df_test[item2] = pd.to_datetime(df_test[item2])
        # Delta_t
        df_test['Delta_t'] = abs(((df_test.loc[:, item1] -
                                   df_test.loc[:, item2]
                                   ).dt.total_seconds() / 3600).astype(int))
        # Delta D
        df_test['Delta_D'] = abs(df_test[[item1, item2]].min(axis=1) - 
                                 dt.datetime.now().date()
                                 ).astype('timedelta64[h]').astype('int')
        df_test['S_ij'] = 1 / (df_test.Delta_t + df_test.Delta_D)
        SimScore = df_test.S_ij.sum(axis=0)
        del df_test
        return SimScore

我可以更优化它吗?为了快速返回结果? 至于100X100,它的方法大约需要10分钟,而简单的数学运算,对于900X900则需要大约13个小时。

Similarity()大约需要0.1400秒才能找到两个随机项之间的相似性。

0 个答案:

没有答案