文件处理器使用多处理

时间:2012-09-21 10:43:50

标签: python multiprocessing python-2.6

我正在编写一个文件处理器,可以(希望)解析任意文件并对解析的内容执行任意操作。文件处理器需要连续运行。我关注的基本想法是

  1. 每个文件都有两个关联的进程(一个用于读取,另一个用于解析和写入其他位置)
  2. 读者将读取一行到公共缓冲区(比如Queue)直到EOF或缓冲区已满。然后等待(睡觉)
  3. Writer将从缓冲区中读取,解析内容,将其写入(比如说)DB,直到缓冲区不为空。然后等待(睡觉)
  4. 中断主程序将导致读写器安全退出(缓冲区可以在不写入的情况下被冲走)
  5. 程序运行正常。但是,有时Writer会先初始化并找到缓冲区为空。所以它会入睡。读者将填充缓冲区并睡觉。因此对于sleep_interval我的代码什么都不做。为了解决这个问题,我尝试使用multiprocessing.Event()向作者发出信号,告知缓冲区有一些可以处理的条目。

    我的代码是

    import multiprocessing
    import time
    import sys
    import signal
    import Queue
    
    class FReader(multiprocessing.Process): 
        """
        A basic file reader class
        It spawns a new process that shares a queue with the writer process
        """
        def __init__(self,queue,fp,sleep_interval,read_offset,event): 
            self.queue = queue
            self.fp = fp
            self.sleep_interval = sleep_interval
            self.offset = read_offset
            self.fp.seek(self.offset)
            self.event = event
            self.event.clear()
            super(FReader,self).__init__()
    
        def myhandler(self,signum,frame): 
            self.fp.close()
            print "Stopping Reader"
            sys.exit(0)
    
        def run(self): 
            signal.signal(signal.SIGINT,self.myhandler)
            signal.signal(signal.SIGCLD,signal.SIG_DFL)
            signal.signal(signal.SIGILL,self.myhandler)
            while True: 
                sleep_now = False
                if not self.queue.full(): 
                    print "READER:Reading"
                    m = self.fp.readline()
                    if not self.event.is_set(): 
                        self.event.set()
                    if m: 
                        self.queue.put((m,self.fp.tell()),block=False)
                    else: 
                        sleep_now = True 
                else: 
                    print "Queue Full"
                    sleep_now = True
    
                if sleep_now: 
                    print "Reader sleeping for %d seconds"%self.sleep_interval
                    time.sleep(self.sleep_interval)            
    
    class FWriter(multiprocessing.Process): 
        """
        A basic file writer class
        It spawns a new process that shares a queue with the reader process
        """
        def __init__(self,queue,session,sleep_interval,fp,event): 
            self.queue = queue
            self.session = session
            self.sleep_interval = sleep_interval
            self.offset = 0
            self.queue_offset = 0
            self.fp = fp
            self.dbqueue = Queue.Queue(50)
            self.event = event
            self.event.clear()
            super(FWriter,self).__init__()
    
        def myhandler(self,signum,frame): 
            #self.session.commit()
            self.session.close()
            self.fp.truncate()
            self.fp.write(str(self.offset))
            self.fp.close()
            print "Stopping Writer"
            sys.exit(0)
    
        def process_line(self,line): 
            #Do not process comments
            if line[0] == '#': 
                return None
            my_list = []
            split_line = line.split(',')
            my_list = split_line
            return my_list
    
        def run(self): 
            signal.signal(signal.SIGINT,self.myhandler)
            signal.signal(signal.SIGCLD,signal.SIG_DFL)
            signal.signal(signal.SIGILL,self.myhandler)
            while True: 
                sleep_now = False
                if not self.queue.empty(): 
                    print "WRITER:Getting"
                    line,offset = self.queue.get(False)
                    #Process the line just read
                    proc_line = self.process_line(line)
                    if proc_line: 
                        #Must write it to DB. Put it into DB Queue
                        if self.dbqueue.full(): 
                            #DB Queue is full, put data into DB before putting more data
                            self.empty_dbqueue()
                        self.dbqueue.put(proc_line)
                        #Keep a track of the maximum offset in the queue
                        self.queue_offset = offset if offset > self.queue_offset else self.queue_offset
                else: 
                    #Looks like writing queue is empty. Just check if DB Queue is empty too
                    print "WRITER: Empty Read Queue"
                    self.empty_dbqueue()
                    sleep_now = True
                if sleep_now: 
                    self.event.clear()
                    print "WRITER: Sleeping for %d seconds"%self.sleep_interval
                    #time.sleep(self.sleep_interval)
                    self.event.wait(5) 
    
    
    
        def empty_dbqueue(self): 
            #The DB Queue has many objects waiting to be written to the DB. Lets write them 
            print "WRITER:Emptying DB QUEUE"
            while True: 
                try: 
                    new_line = self.dbqueue.get(False)
                except Queue.Empty: 
                    #Write the new offset to file
                    self.offset = self.queue_offset
                    break
                print new_line[0]
    
    def main(): 
        write_file = '/home/xyz/stats.offset'
        wp = open(write_file,'r')
        read_offset = wp.read()
        try: 
            read_offset = int(read_offset)
        except ValueError: 
            read_offset = 0
        wp.close()
        print read_offset
        read_file = '/var/log/somefile'
        file_q = multiprocessing.Queue(100)
        ev = multiprocessing.Event()
        new_reader = FReader(file_q,open(read_file,'r'),30,read_offset,ev)
        new_writer = FWriter(file_q,open('/dev/null'),30,open(write_file,'w'),ev)
        new_reader.start()
        new_writer.start()
        try: 
            new_reader.join()
            new_writer.join()
        except KeyboardInterrupt: 
            print "Closing Master"
            new_reader.join()
            new_writer.join()
    
    if __name__=='__main__': 
        main()
    

    Writer中的dbqueue用于批处理数据库写入和每行保留该行的偏移量。写入DB的最大偏移量在退出时存储到偏移量文件中,以便我可以在下次运行时离开。数据库对象(会话)只是'/dev/null'用于演示。

    以前而不是

    self.event.wait(5)
    

    我在做

    time.sleep(self.sleep_interval)
    

    其中(正如我所说)运作良好但引入了一点延迟。但随后流程完全退出。

    现在在主进程上执行Ctrl-C时,阅读器退出但作者抛出OSError

    ^CStopping Reader
    Closing Master
    Stopping Writer
    Process FWriter-2:
    Traceback (most recent call last):
      File "/usr/lib64/python2.6/multiprocessing/process.py", line 232, in _bootstrap
        self.run()
      File "FileParse.py", line 113, in run
        self.event.wait(5)
      File "/usr/lib64/python2.6/multiprocessing/synchronize.py", line 303, in wait
        self._cond.wait(timeout)
      File "/usr/lib64/python2.6/multiprocessing/synchronize.py", line 212, in wait
        self._wait_semaphore.acquire(True, timeout)
    OSError: [Errno 0] Error
    

    我知道event.wait()以某种方式阻止了代码,但我无法解决这个问题。我尝试在self.event.wait(5)块中包装sys.exit()try: except OSError:,但这只会使程序永久挂起。

    我正在使用Python-2.6

1 个答案:

答案 0 :(得分:1)

我认为最好对Writer类使用Queue阻塞超时 - 使用Queue.get(True,5),然后如果在时间间隔内将某些内容放入队列,则Writer会立即醒来...... Writer循环就像是:

while True: 
    sleep_now = False
    try:
        print "WRITER:Getting"
        line,offset = self.queue.get(True, 5)
        #Process the line just read
        proc_line = self.process_line(line)
        if proc_line: 
            #Must write it to DB. Put it into DB Queue
            if self.dbqueue.full(): 
                #DB Queue is full, put data into DB before putting more data
                self.empty_dbqueue()
            self.dbqueue.put(proc_line)
            #Keep a track of the maximum offset in the queue
            self.queue_offset = offset if offset > self.queue_offset else self.queue_offset
    except Queue.Empty: 
        #Looks like writing queue is empty. Just check if DB Queue is empty too
        print "WRITER: Empty Read Queue"
        self.empty_dbqueue()