Question

这是一个目标：一个解析器，它将某些域中的某些信息重新组合在一起并将它们组织到一个地方。

我是Python的新手，因为学习曲线和事物而选择用这种语言完成这项工作。

就此而言，我正在使用BeautifulSoup lib进行解析，这就像魅力一样。该例程通过CentOS 6，Python 2.7中的crontab触发。

然而，我的一个解析脚本向我发送了一个内存错误的日志，导致py文件退出而没有完成其工作的原因。谷歌在这里和那里发现，一些非常长的HTML Python解析将使我的服务器内存不足。它会更好地关闭，分解甚至垃圾收集所有脚本都不会再使用它了。

实现了三件事，crontab任务中没有更多的内存错误。但是，每次脚本运行时，我都会收到来自crontab的电子邮件，其中包含解析日志，这意味着出现了问题。检查数据库，所有信息都记录正常，脚本也完成了整个任务，仍然发生了一些错误，或者crontab不会通过电子邮件向我发送日志。

实际上，当我直接在服务器上的终端上运行脚本时，同样会发生：脚本不会结束，除非我 ctrl + c 它，它将在屏幕上冻结。然而，再次，看着银行，所有的任务完成没有错误。

我尝试过只使用gc，只尝试close（）和只发布（）。这三种资源中的任何一种都会冻结屏幕/生成日志错误（但是没有明确的错误）。

这是我正在做的一个简单版本，以便更好地理解：\

class GrabCategories(): 
    def __init__(self):   

        target = 'http://provider-site.com/info.html'
        try:
            page = urllib2.urlopen(target)
            if page.getcode() == 404:
                print 'Page not found', target
                return False
            soup = BeautifulSoup(page.read())
            page.close() #not using this anymore, may I close it?
        except:
            print 'Could not open', target
            return

        content = soup.find('div', {'id': 'box-content'})
        soup.decompose() #not using this anymore, may I decompose it?

        c=0
        for link in content.findAll('a'):

            #define some vars

            try:
                catPage = urllib2.urlopen(link['a'])
                if catPage.getcode() == 404:
                    print 'Page not found', catPage
                    return False
                catSoup = BeautifulSoup(catPage.read())
                catPage.close() #not using this anymore, may I close it?
            except:
                print 'Could no open', target
                continue

            #do some things with the page content etc 

            catSoup.decompose() #not using this anymore, may I decompose it?

            if(c%10):
                gc.collect()
            c=c+1

Python垃圾收集导致crontab日志错误

0 个答案: