使用多处理和pandas编写文件并不能产生正确的结果

时间:2014-05-02 16:19:49

标签: python file pandas multiprocessing

概述:

我试图使用多处理和熊猫。我的想法是,我有一个大文件,我无法适应内存,我需要对它进行一些数据操作并生成N个文件。其中N是核心数(对我而言,8)。

工作原理:

如果我没有使用multiprocessing,一切正常。并且行号与原始文件和所有输出文件alltogeter相同:

111013 litle_dd.csv

并输出:

root@xxxx:/tmp/tmpXw_KKV$ wc -l *
   14000 part_0
   14000 part_1
   14000 part_2
   14000 part_3
   14000 part_4
   14000 part_5
   14000 part_6
   13011 part_7
  111011 total

实际上有两行 - 标题和我用skiprows=[1]跳过的一行

什么不起作用

root@xxxxxx:/tmp/tmpd7GcYP$ wc -l *
   112000 part_0
   112000 part_1
   112000 part_2
   112000 part_3
   112000 part_4
   112000 part_5
   112000 part_6
   104088 part_7
   888088 total

正如你所看到的行数要多得多(我认为它取决于核心数量 - 8)

代码

import sys                                                                      
import os                                                                       
import multiprocessing as mp                                                    
import Queue                                                                    
import subprocess                                                               
import tempfile                                                                 
import csv                                                                      
import logging                                                                  

import pandas as pd                                                             


logging.basicConfig(level=logging.DEBUG,                                        
                    format='%(asctime)s - %(name)s - %(message)s')              
log = logging.getLogger('converter')                                            

if len(sys.argv) != 3:                                                          
    raise Exception("Need more arguments")                                      


queue = Queue.Queue()                                                           
pool = mp.Pool(mp.cpu_count())                                                  

file_path = sys.argv[1]                                                         
enc = sys.argv[2]                                                               

temp_folder = tempfile.mkdtemp()                                                
workers = []                                                                    

def write_to_file():                                                            
    while True:                                                                 
        try:                                                                    
            df, name = queue.get(False)                                         
            df.to_csv(name, index_label=False, quoting=csv.QUOTE_ALL, mode='a', 
                      header=False)                                             
            queue.task_done()                                                   
            log.debug(queue.qsize())                                            
        except Queue.Empty:                                                     
            log.info('No more items for this process')                          
            break                                                               
        except Exception:                                                       
            log.error("Error",exc_info=True)                                    
            break                                                               

def finished(name):                                                             
    return "Data was written to %s" % name                                      

if __name__ == "__main__":                                                      
    texter = pd.read_csv(file_path, sep='|', skipinitialspace=True,             
                         quoting=csv.QUOTE_ALL, compression=enc, skiprows=[1],  
                         iterator=True, chunksize=1000)                         

    for i, df in enumerate(texter):                                             
        part_nm = i % mp.cpu_count()                                            
        name = os.path.join(temp_folder, 'part_%s' %  part_nm)                  
        queue.put((df, name))                                                   

    write_to_file()                                                             

    #for i in xrange(mp.cpu_count()):                                           
    #    p = mp.Process(target=write_to_file)                                   
    #    p.daemon = True                                                        
    #    workers.append(p)                                                      

    #for i in workers:                                                          
    #    i.start()                                                              

    #for i in workers:                                                          
    #    i.join()                                                               

    #log.info('The End')

0 个答案:

没有答案