从pandas列连接切换到在数据帧上使用apply时性能下降

时间:2015-08-06 07:56:08

标签: python pandas

我有一个pandas数据帧,用于保存csv文件中的数据。我想连接几个列,我首先用简单的pandas列连接硬编码,然后我重构了更一般的代码,但是我在运行时严重受到惩罚,这里是串联的两个版本及其时间:

    t0 = time.time()
    cleaned_data_set1 = data_set.col1.map(str)+" "+data_set.col2.map(str)+ " " + data_set.col3.map(str)
    t1 = time.time()

    print t1-t0
    listOfObjectAttributeNames = ["col1","col2","col3"]
    t0 = time.time()
    cleaned_data_set = data_set.apply(lambda x: " ".join([str(el) for el in x[listOfObjectAttributeNames]]), axis=1)
    t1 = time.time()

这里分别是执行时间:

1.20745110512
171.689060926

如何改善第二版的运行时间?

1 个答案:

答案 0 :(得分:0)

我设法保留了矢量化版本并将操作概括如下:

        t0 = time.time()

        listOfObjectAttributeNames = ["col1","col2","col3"]

        cleaned_data_set = ""
        for i in listOfObjectAttributeNames:
            cleaned_data_set = cleaned_data_set + data_set[i].map(str)

        t1 = time.time()

        print t1 - t0 

运行时结果:

 1.04347801208