导出到csv时如何优化代码?

时间:2019-09-16 12:38:10

标签: pandas

我正在尝试根据某些规则导出到csv文件。导出到csv会花费更多时间。谁能建议如何优化代码?

代码段:

readCsv = pd.read_csv(inputFile)     

readCsv.head()     

readCsv.columns  

readCsv[(readCsv[attributeKey.title()].str.casefold()).str.contains(str.lower(Key))==True]
.to_excel(r"C:\\User\\Desktop\\resultSet.xlsx", index = None, header=True)

1 个答案:

答案 0 :(得分:0)

像这样吗?

rjust

编辑1:答案2

import pandas as pd
import time

df = pd.DataFrame({'Column1': pd.util.testing.rands_array(10, 10000000)})

attributeKey = 'column1'
Key = 'abc' #the string you are checking for

start = time.time()
df[df[attributeKey.title()].str.lower().str.contains((Key).lower())]
end = time.time()

print(end - start)

df.to_excel('output.xlsx')

10.000.000行需要import pandas as pd import time df = pd.DataFrame({'Column1': pd.util.testing.rands_array(10, 10000000)}) attributeKey = 'column1' Key = 'abc' #the string you are checking for start = time.time() df[df[attributeKey.title()].str.lower().str.contains((Key).lower())] end = time.time() print(end - start) df.to_excel('output.xlsx') 秒。任何更大的东西,我都会遇到内存错误。如果您需要更大的尺寸,则可能需要研究Dask。

编辑2:答案3

更改应用程序可以将时间减少约一半。

6.47

输出:import pandas as pd import time def check_key(s): return KEY.lower() in s.lower() df = pd.DataFrame({'Column1': pd.util.testing.rands_array(10, 10000000)}) KEY = 'abc' #the string you are checking for ATTRIBUTE_KEY = 'column1' start = time.time() df[df[ATTRIBUTE_KEY.title()].apply(check_key)] end = time.time() print(end - start)

编辑3:答案4

只是为了好玩,尝试进行多处理:

3.3952105045318604

输出from multiprocessing import Pool from functools import partial import numpy as np def parallelize(data, func, num_of_processes=8): data_split = np.array_split(data, num_of_processes) pool = Pool(num_of_processes) data = pd.concat(pool.map(func, data_split)) pool.close() pool.join() return data def run_on_subset(func, data_subset): return data_subset.apply(func) def parallelize_on_rows(data, func, num_of_processes=8): return parallelize(data, partial(run_on_subset, func), num_of_processes) def check_key(s): return KEY.lower() in s.lower() df = pd.DataFrame({'Column1': pd.util.testing.rands_array(10, 10000000)}) KEY = 'abc' #the string you are checking for ATTRIBUTE_KEY = 'column1' start = time.time() parallelize_on_rows(df[ATTRIBUTE_KEY.title()], check_key) end = time.time() print(end - start) ,因此答案3似乎是最有效的,无论如何都具有这种大小的数据。