我有一个pandas数据帧,用于保存csv文件中的数据。我想连接几个列,我首先用简单的pandas列连接硬编码,然后我重构了更一般的代码,但是我在运行时严重受到惩罚,这里是串联的两个版本及其时间:
t0 = time.time()
cleaned_data_set1 = data_set.col1.map(str)+" "+data_set.col2.map(str)+ " " + data_set.col3.map(str)
t1 = time.time()
print t1-t0
listOfObjectAttributeNames = ["col1","col2","col3"]
t0 = time.time()
cleaned_data_set = data_set.apply(lambda x: " ".join([str(el) for el in x[listOfObjectAttributeNames]]), axis=1)
t1 = time.time()
这里分别是执行时间:
1.20745110512
171.689060926
如何改善第二版的运行时间?
答案 0 :(得分:0)
我设法保留了矢量化版本并将操作概括如下:
t0 = time.time()
listOfObjectAttributeNames = ["col1","col2","col3"]
cleaned_data_set = ""
for i in listOfObjectAttributeNames:
cleaned_data_set = cleaned_data_set + data_set[i].map(str)
t1 = time.time()
print t1 - t0
运行时结果:
1.04347801208