我试图使用多处理和熊猫。我的想法是,我有一个大文件,我无法适应内存,我需要对它进行一些数据操作并生成N个文件。其中N是核心数(对我而言,8
)。
如果我没有使用multiprocessing
,一切正常。并且行号与原始文件和所有输出文件alltogeter相同:
111013 litle_dd.csv
并输出:
root@xxxx:/tmp/tmpXw_KKV$ wc -l *
14000 part_0
14000 part_1
14000 part_2
14000 part_3
14000 part_4
14000 part_5
14000 part_6
13011 part_7
111011 total
实际上有两行 - 标题和我用skiprows=[1]
跳过的一行
root@xxxxxx:/tmp/tmpd7GcYP$ wc -l *
112000 part_0
112000 part_1
112000 part_2
112000 part_3
112000 part_4
112000 part_5
112000 part_6
104088 part_7
888088 total
正如你所看到的行数要多得多(我认为它取决于核心数量 - 8)
import sys
import os
import multiprocessing as mp
import Queue
import subprocess
import tempfile
import csv
import logging
import pandas as pd
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s - %(name)s - %(message)s')
log = logging.getLogger('converter')
if len(sys.argv) != 3:
raise Exception("Need more arguments")
queue = Queue.Queue()
pool = mp.Pool(mp.cpu_count())
file_path = sys.argv[1]
enc = sys.argv[2]
temp_folder = tempfile.mkdtemp()
workers = []
def write_to_file():
while True:
try:
df, name = queue.get(False)
df.to_csv(name, index_label=False, quoting=csv.QUOTE_ALL, mode='a',
header=False)
queue.task_done()
log.debug(queue.qsize())
except Queue.Empty:
log.info('No more items for this process')
break
except Exception:
log.error("Error",exc_info=True)
break
def finished(name):
return "Data was written to %s" % name
if __name__ == "__main__":
texter = pd.read_csv(file_path, sep='|', skipinitialspace=True,
quoting=csv.QUOTE_ALL, compression=enc, skiprows=[1],
iterator=True, chunksize=1000)
for i, df in enumerate(texter):
part_nm = i % mp.cpu_count()
name = os.path.join(temp_folder, 'part_%s' % part_nm)
queue.put((df, name))
write_to_file()
#for i in xrange(mp.cpu_count()):
# p = mp.Process(target=write_to_file)
# p.daemon = True
# workers.append(p)
#for i in workers:
# i.start()
#for i in workers:
# i.join()
#log.info('The End')