Python多处理csv文件上的EOF错误

时间:2013-10-27 00:25:22

标签: python csv concurrency multiprocessing

我正在尝试实现multiprocessing方法来读取和比较两个csv文件。为了让我开始,我开始使用embarassingly parallel problems中的代码示例,它将文件中的整数相加。问题是该示例不会为我运行。 (我在Windows上运行Python 2.6。)

我收到以下EOF错误:

File "C:\Python26\lib\pickle.py", line 880, in load_eof
raise EOFError
EOFError

在这一行:

self.pin.start()

我发现一些examples表明问题可能是csv开放方法需要'rb'。我试过了,但那也不起作用。

然后我尝试简化代码以在最基本的级别重现错误。我在同一条线上得到了同样的错误。即使我简化了parse_input_csv函数甚至没有读取文件。 (如果文件没有被读取,不确定如何触发EOF?)

import csv
import multiprocessing

class CSVWorker(object):
    def __init__(self, infile, outfile):
        #self.infile = open(infile)
        self.infile = open(infile, 'rb') #try rb for Windows

        self.in_csvfile = csv.reader(self.infile)
        self.inq = multiprocessing.Queue()    
        self.pin = multiprocessing.Process(target=self.parse_input_csv, args=())

        self.pin.start()
        self.pin.join()    
        self.infile.close()

    def parse_input_csv(self):
#         for i, row in enumerate(self.in_csvfile):
#             self.inq.put( (i, row) )

#         for row in self.in_csvfile:
#             print row
#             #self.inq.put( row )

        print 'yup'


if __name__ == '__main__':        
    c = CSVWorker('random_ints.csv', 'random_ints_sums.csv')
    print 'done' 

最后,我尝试将它全部拉到课堂之外。如果我不迭代csv,这可以工作,但如果我这样做会给出相同的错误。

def manualCSVworker(infile, outfile):
    f = open(infile, 'rb')
    in_csvfile = csv.reader(f)        
    inq = multiprocessing.Queue()

    # this works (no reading csv file)
    pin = multiprocessing.Process(target=manual_parse_input_csv, args=(in_csvfile,))

    # this does not work (tries to read csv, and fails with EOFError)
    #pin = multiprocessing.Process(target=print_yup, args=())

    pin.start()
    pin.join()    
    f.close()

def print_yup():
    print 'yup'

def manual_parse_input_csv(csvReader):    
    for row in csvReader:
        print row

if __name__ == '__main__':        
    manualCSVworker('random_ints.csv', 'random_ints_sums.csv')
    print 'done' 

有人可以帮我在这里找出问题吗?

编辑:我想我会发布工作代码。我最终放弃了Class实现。正如Tim Peters所建议的,我只传递文件名(不是打开的文件)。

在500万行x 2列上,我注意到2个处理器与1相比,时间提高了大约20%。我期待更多,但我认为这个问题是排队的额外开销。根据{{​​3}},改进可能是以100或更多的块(而不是每一行)排队记录。

import csv
import multiprocessing
from datetime import datetime

NUM_PROCS = multiprocessing.cpu_count()

def main(numprocsrequested, infile, outfile):

    inq = multiprocessing.Queue()
    outq = multiprocessing.Queue()

    numprocs = min(numprocsrequested, NUM_PROCS)

    pin = multiprocessing.Process(target=parse_input_csv, args=(infile,numprocs,inq,))
    pout = multiprocessing.Process(target=write_output_csv, args=(outfile,numprocs,outq,))
    ps = [ multiprocessing.Process(target=sum_row, args=(inq,outq,)) for i in range(numprocs)]

    pin.start()
    pout.start()
    for p in ps:
        p.start()

    pin.join()
    i = 0
    for p in ps:
        p.join()
        #print "Done", i
        i += 1
    pout.join()

def parse_input_csv(infile, numprocs, inq):
        """Parses the input CSV and yields tuples with the index of the row
        as the first element, and the integers of the row as the second
        element.

        The index is zero-index based.

        The data is then sent over inqueue for the workers to do their
        thing.  At the end the input thread sends a 'STOP' message for each
        worker.
        """
        f = open(infile, 'rb')
        in_csvfile = csv.reader(f)

        for i, row in enumerate(in_csvfile):
            row = [ int(entry) for entry in row ]
            inq.put( (i,row) )

        for i in range(numprocs):
            inq.put("STOP")

        f.close()

def sum_row(inq, outq):
    """
    Workers. Consume inq and produce answers on outq
    """
    tot = 0
    for i, row in iter(inq.get, "STOP"):
        outq.put( (i, sum(row)) )
    outq.put("STOP")

def write_output_csv(outfile, numprocs, outq):
    """
    Open outgoing csv file then start reading outq for answers
    Since I chose to make sure output was synchronized to the input there
    is some extra goodies to do that.

    Obviously your input has the original row number so this is not
    required.
    """

    cur = 0
    stop = 0
    buffer = {}
    # For some reason csv.writer works badly across threads so open/close
    # and use it all in the same thread or else you'll have the last
    # several rows missing
    f = open(outfile, 'wb')
    out_csvfile = csv.writer(f)

    #Keep running until we see numprocs STOP messages
    for works in range(numprocs):
        for i, val in iter(outq.get, "STOP"):
            # verify rows are in order, if not save in buffer
            if i != cur:
                buffer[i] = val
            else:
                #if yes are write it out and make sure no waiting rows exist
                out_csvfile.writerow( [i, val] )
                cur += 1
                while cur in buffer:
                    out_csvfile.writerow([ cur, buffer[cur] ])
                    del buffer[cur]
                    cur += 1
    f.close()

if __name__ == '__main__':

    startTime = datetime.now()
    main(4, 'random_ints.csv', 'random_ints_sums.csv')
    print 'done'
    print(datetime.now()-startTime)

1 个答案:

答案 0 :(得分:4)

跨进程传递对象需要在发送端“腌制”它(创建对象的字符串表示)并在接收端“取消”它(从字符串表示重新创建同构对象)。除非你确切知道自己在做什么,否则你应该坚持传递内置的Python类型(字符串,整数,浮点数,列表,字符串......)或 {/ 1>}实现的类型 multiprocessingLock(),...)。否则机会很好,泡菜不起作用。

传递打开的文件是不可能的,更不用说包含在另一个对象(例如由Queue()返回)中的打开文件。当我运行您的代码时,我收到来自csv.reader(f)的错误消息:

pickle
你不是吗?永远不要忽视错误 - 除非您再确切知道自己在做什么。

解决方案很简单:正如我在评论中所说,打开 in 工作进程,只是传递其字符串路径。例如,请改用:

pickle.PicklingError: Can't pickle <type '_csv.reader'>: it's not the same object as _csv.reader

并获取def manual_parse_input_csv(csvfile): f = open(csvfile,'rb') in_csvfile = csv.reader(f) for row in in_csvfile: print row f.close() 的所有代码 out ,并将流程创建行更改为:

manualCSVworker

请参阅?它传递文件 path ,一个普通的字符串。这有效: - )