Question

基本上，我从数据库中提取了一系列链接，并希望将它们刮到我要查找的特定链接上。然后，我将这些链接重新输入到我的多个QWebViews引用的链接队列中，然后它们继续将这些链接拉下来进行处理/存储。

我的问题是，当它运行到200或500个链接时，它开始消耗越来越多的RAM。

我已经使用堆，memory_profiler和objgraph进行了详尽的研究，以找出导致内存泄漏的原因...随着时间的流逝，python堆的对象在数量和大小上保持不变。这让我觉得C ++对象没有被删除。当然，使用memory_profiler， RAM仅在调用self.load（self.url）代码行时才会上升。我已尝试解决此问题，但无济于事。

代码：

from PyQt4.QtCore import QUrl
from PyQt4.QtWebKit import QWebView, QWebSettings
from PyQt4.QtGui import QApplication
from lxml.etree import HTMLParser

# My functions
from util import dump_list2queue, parse_doc

class ThreadFlag:
    def __init__(self, threads, jid, db):
        self.threads = threads
        self.job_id = jid
        self.db_direct = db
        self.xml_parser = HTMLParser()

class WebView(QWebView):
    def __init__(self, thread_flag, id_no):
        super(QWebView, self).__init__()
        self.loadFinished.connect(self.handleLoadFinished)
        self.settings().globalSettings().setAttribute(QWebSettings.AutoLoadImages, False)
        # This is actually a dict with a few additional details about the url we want to pull
        self.url = None
        # doing one instance of this to avoid memory leaks
        self.qurl = QUrl()
        # id of the webview instance
        self.id = id_no
        # Status webview instance, green mean it isn't working and yellow means it is.
        self.status = 'GREEN'
        # Reference to a single universal object all the webview instances can see.
        self.thread_flag = thread_flag

    def handleLoadFinished(self):
        try:
            self.processCurrentPage()
        except Exception as e:
            print e

        self.status = 'GREEN'

        if not self.fetchNext():
            # We're finished!
            self.loadFinished.disconnect()
            self.stop()
        else:
            # We're not finished! Do next url.
            self.qurl.setUrl(self.url['url'])
            self.load(self.qurl)

    def processCurrentPage(self):
        self.frame = str(self.page().mainFrame().toHtml().toUtf8())

        # This is the case for the initial web pages I want to gather links from.
        if 'name' in self.url:
            # Parse html string for links I'm looking for.
            new_links = parse_doc(self.thread_flag.xml_parser, self.url, self.frame)
            if len(new_links) == 0: return 0
            fkid = self.url['pkid']
            new_links = map(lambda x: (fkid, x['title'],x['url'], self.thread_flag.job_id), new_links)


            # Post links to database, db de-dupes and then repull ones that made it.
            self.thread_flag.db_direct.post_links(new_links)
            added_links = self.thread_flag.db_direct.get_links(self.thread_flag.job_id,fkid)

            # Add the pulled links to central queue all the qwebviews pull from
            dump_list2queue(added_links, self._urls)
            del added_links
        else:
            # Process one of the links I pulled from the initial set of data that was originally in the queue.
            print "Processing target link!"

    # Get next url from the universal queue!
    def fetchNext(self):
        if self._urls and self._urls.empty():
            self.status = 'GREEN'
            return False
        else:
            self.status = 'YELLOW'
            self.url = self._urls.get()
            return True

    def start(self, urls):
        # This is where the reference to the universal queue gets made.
        self._urls = urls
        if self.fetchNext():
            self.qurl.setUrl(self.url['url'])
            self.load(self.qurl)

# uq = central url queue shared between webview instances
# ta = array of webview objects
# tf - thread flag (basically just a custom universal object that all the webviews can access).

# This main "program" is started by another script elsewhere.
def main_program(uq, ta, tf):

    app = QApplication([])
    webviews = ta
    threadflag = tf

    tf.app = app

    print "Beginning the multiple async web calls..."

    # Create n "threads" (really just webviews) that each will make asynchronous calls.
    for n in range(0,threadflag.threads):
        webviews.append(WebView(threadflag, n+1))
        webviews[n].start(uq)

    app.exec_()

这就是我的记忆工具所说的（它们在整个程序中都是恒定不变的）

RAM：resource.getrusage（resource.RUSAGE_SELF）.ru_maxrss / 1024

2491（MB）

对象最常见的类型：

方法描述符9959

功能8342

弱引用6440

元组6418

dict 4982

wrapper_descriptor 4380

getset_descriptor 2314

列表1890

method_descriptor 1445

builtin_function_or_method 1298

堆：

一组9879个对象的分区。总大小= 1510000字节。

索引计数％大小％累积％种类（类/类的字典）

 0   2646  27   445216  29    445216  29 str

 1    563   6   262088  17    707304  47 dict (no owner)

 2   2267  23   199496  13    906800  60 __builtin__.weakref

 3   2381  24   179128  12   1085928  72 tuple

 4    212   2   107744   7   1193672  79 dict of guppy.etc.Glue.Interface

 5     50   1    52400   3   1246072  83 dict of guppy.etc.Glue.Share

 6    121   1    40200   3   1286272  85 list

 7    116   1    32480   2   1318752  87 dict of guppy.etc.Glue.Owner

 8    240   2    30720   2   1349472  89 types.CodeType

 9     42   0    24816   2   1374288  91 dict of class

Answer 1

由于C ++代码的存在，您的程序确实正在增长，但是就不再引用的对象的创建而言，这并不是真正的泄漏。至少部分发生了什么事情，您的QWebView拥有一个QWebPage，该QWebPage拥有一个QWebHistory（）。每次您调用self.load时，历史记录会变得更长一些。

请注意，QWebHistory具有clear（）函数。

可用文档：http://pyqt.sourceforge.net/Docs/PyQt4/qwebview.html#history

Pyqt 4-QWebView.load（url）泄漏内存（不是来自python）

1 个答案: