Pyqt 4-QWebView.load(url)泄漏内存(不是来自python)

时间:2018-07-08 05:38:10

标签: python qt memory-leaks pyqt qwebview

基本上,我从数据库中提取了一系列链接,并希望将它们刮到我要查找的特定链接上。然后,我将这些链接重新输入到我的多个QWebViews引用的链接队列中,然后它们继续将这些链接拉下来进行处理/存储。

我的问题是,当它运行到200或500个链接时,它开始消耗越来越多的RAM。

我已经使用堆,memory_profiler和objgraph进行了详尽的研究,以找出导致内存泄漏的原因...随着时间的流逝,python堆的对象在数量和大小上保持不变。这让我觉得C ++对象没有被删除。当然,使用memory_profiler, RAM仅在调用self.load(self.url)代码行时才会上升。我已尝试解决此问题,但无济于事。

代码:

from PyQt4.QtCore import QUrl
from PyQt4.QtWebKit import QWebView, QWebSettings
from PyQt4.QtGui import QApplication
from lxml.etree import HTMLParser

# My functions
from util import dump_list2queue, parse_doc

class ThreadFlag:
    def __init__(self, threads, jid, db):
        self.threads = threads
        self.job_id = jid
        self.db_direct = db
        self.xml_parser = HTMLParser()

class WebView(QWebView):
    def __init__(self, thread_flag, id_no):
        super(QWebView, self).__init__()
        self.loadFinished.connect(self.handleLoadFinished)
        self.settings().globalSettings().setAttribute(QWebSettings.AutoLoadImages, False)
        # This is actually a dict with a few additional details about the url we want to pull
        self.url = None
        # doing one instance of this to avoid memory leaks
        self.qurl = QUrl()
        # id of the webview instance
        self.id = id_no
        # Status webview instance, green mean it isn't working and yellow means it is.
        self.status = 'GREEN'
        # Reference to a single universal object all the webview instances can see.
        self.thread_flag = thread_flag

    def handleLoadFinished(self):
        try:
            self.processCurrentPage()
        except Exception as e:
            print e

        self.status = 'GREEN'

        if not self.fetchNext():
            # We're finished!
            self.loadFinished.disconnect()
            self.stop()
        else:
            # We're not finished! Do next url.
            self.qurl.setUrl(self.url['url'])
            self.load(self.qurl)

    def processCurrentPage(self):
        self.frame = str(self.page().mainFrame().toHtml().toUtf8())

        # This is the case for the initial web pages I want to gather links from.
        if 'name' in self.url:
            # Parse html string for links I'm looking for.
            new_links = parse_doc(self.thread_flag.xml_parser, self.url, self.frame)
            if len(new_links) == 0: return 0
            fkid = self.url['pkid']
            new_links = map(lambda x: (fkid, x['title'],x['url'], self.thread_flag.job_id), new_links)


            # Post links to database, db de-dupes and then repull ones that made it.
            self.thread_flag.db_direct.post_links(new_links)
            added_links = self.thread_flag.db_direct.get_links(self.thread_flag.job_id,fkid)

            # Add the pulled links to central queue all the qwebviews pull from
            dump_list2queue(added_links, self._urls)
            del added_links
        else:
            # Process one of the links I pulled from the initial set of data that was originally in the queue.
            print "Processing target link!"

    # Get next url from the universal queue!
    def fetchNext(self):
        if self._urls and self._urls.empty():
            self.status = 'GREEN'
            return False
        else:
            self.status = 'YELLOW'
            self.url = self._urls.get()
            return True

    def start(self, urls):
        # This is where the reference to the universal queue gets made.
        self._urls = urls
        if self.fetchNext():
            self.qurl.setUrl(self.url['url'])
            self.load(self.qurl)

# uq = central url queue shared between webview instances
# ta = array of webview objects
# tf - thread flag (basically just a custom universal object that all the webviews can access).

# This main "program" is started by another script elsewhere.
def main_program(uq, ta, tf):

    app = QApplication([])
    webviews = ta
    threadflag = tf

    tf.app = app

    print "Beginning the multiple async web calls..."

    # Create n "threads" (really just webviews) that each will make asynchronous calls.
    for n in range(0,threadflag.threads):
        webviews.append(WebView(threadflag, n+1))
        webviews[n].start(uq)

    app.exec_()

这就是我的记忆工具所说的(它们在整个程序中都是恒定不变的)

  1. RAM:resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024
  

2491(MB)

  1. 对象最常见的类型:
  

方法描述符9959

     

功能8342

     

弱引用6440

     

元组6418

     

dict 4982

     

wrapper_descriptor 4380

     

getset_descriptor 2314

     

列表1890

     

method_descriptor 1445

builtin_function_or_method 1298

  1. 堆:
  

一组9879个对象的分区。总大小= 1510000字节。

     

索引计数%大小%累积%种类(类/类的字典)

 0   2646  27   445216  29    445216  29 str

 1    563   6   262088  17    707304  47 dict (no owner)

 2   2267  23   199496  13    906800  60 __builtin__.weakref

 3   2381  24   179128  12   1085928  72 tuple

 4    212   2   107744   7   1193672  79 dict of guppy.etc.Glue.Interface

 5     50   1    52400   3   1246072  83 dict of guppy.etc.Glue.Share

 6    121   1    40200   3   1286272  85 list

 7    116   1    32480   2   1318752  87 dict of guppy.etc.Glue.Owner

 8    240   2    30720   2   1349472  89 types.CodeType

 9     42   0    24816   2   1374288  91 dict of class

1 个答案:

答案 0 :(得分:0)

由于C ++代码的存在,您的程序确实正在增长,但是就不再引用的对象的创建而言,这并不是真正的泄漏。至少部分发生了什么事情,您的QWebView拥有一个QWebPage,该QWebPage拥有一个QWebHistory()。每次您调用self.load时,历史记录会变得更长一些。

请注意,QWebHistory具有clear()函数。

可用文档:http://pyqt.sourceforge.net/Docs/PyQt4/qwebview.html#history