错误:多处理Python中打开的文件过多

时间:2018-07-15 08:56:59

标签: python multithreading multiprocessing

Input (a.txt) contains data as:
{person1: [www.person1links1.com]}

{person2: [www.person2links1.com,www.person2links2.com]}...(36000 lines of such data)

我有兴趣从每个人的个人链接中提取数据,我的代码是:

def get_bio(authr,urllist):
    author_data=[]
    for each in urllist:
        try:
            html = urllib.request.urlopen(each).read()
            author_data.append(html)
        except:
            continue
    f=open(authr+'.txt','w+')
    for each in author_data:
        f.write(str(each))
        f.write('\n')
        f.write('********************************************')
        f.write('\n')
    f.close()
if __name__ == '__main__':
    q=mp.Queue()
    processes=[]
    with open('a.txt') as f:
        for each in f:
            q.put(each)# dictionary
    while (q.qsize())!=0:
        for authr,urls in q.get().items():
            p=mp.Process(target=get_bio,args=(authr,urls))
            processes.append(p)
            p.start()
    for proc in processes:
        proc.join()

我在运行此代码时遇到以下错误(我尝试设置ulimit但遇到相同的错误):

OSError: [Errno 24] Too many open files: 'personx.txt'
Traceback (most recent call last):
  File "perbio_mp.py", line 88, in <module>
    p.start()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.5/multiprocessing/context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 66, in _launch
    parent_r, child_w = os.pipe()
OSError: [Errno 24] Too many open files

请指出我的错误之处以及如何纠正。 谢谢

2 个答案:

答案 0 :(得分:0)

检查操作系统的文件描述符的最大数量。某些版本的macosx具有256个文件的离散限制,例如El Capitan 10.10

无论如何,您都可以运行以下命令:

ulimit -n 4096

在运行python代码之前。

如果您的代码仍然中断,请检查被称为代码方法def get_bio(authr,urllist)的次数。可能发生的是,循环打开的文件超出了操作系统的处理能力。

答案 1 :(得分:0)

urlopen返回包装打开文件的响应对象。您的代码没有关闭这些文件,因此出现了问题。

响应对象也是context manager,所以不是

    html = urllib.request.urlopen(each).read()
    author_data.append(html)

你可以做

with urllib.request.urlopen(each) as response:
    author_data.append(response.read())

以确保读取后关闭文件。

另外,正如民谣在评论中观察到的那样,您应该将活动进程的数量减少到合理的数量,因为每个进程都会在操作系统级别打开文件。