基本上,我从数据库中提取了一系列链接,并希望将它们刮到我要查找的特定链接上。然后,我将这些链接重新输入到我的多个QWebViews引用的链接队列中,然后它们继续将这些链接拉下来进行处理/存储。
我的问题是,当它运行到200或500个链接时,它开始消耗越来越多的RAM。
我已经使用堆,memory_profiler和objgraph进行了详尽的研究,以找出导致内存泄漏的原因...随着时间的流逝,python堆的对象在数量和大小上保持不变。这让我觉得C ++对象没有被删除。当然,使用memory_profiler, RAM仅在调用self.load(self.url)代码行时才会上升。我已尝试解决此问题,但无济于事。
代码:
from PyQt4.QtCore import QUrl
from PyQt4.QtWebKit import QWebView, QWebSettings
from PyQt4.QtGui import QApplication
from lxml.etree import HTMLParser
# My functions
from util import dump_list2queue, parse_doc
class ThreadFlag:
def __init__(self, threads, jid, db):
self.threads = threads
self.job_id = jid
self.db_direct = db
self.xml_parser = HTMLParser()
class WebView(QWebView):
def __init__(self, thread_flag, id_no):
super(QWebView, self).__init__()
self.loadFinished.connect(self.handleLoadFinished)
self.settings().globalSettings().setAttribute(QWebSettings.AutoLoadImages, False)
# This is actually a dict with a few additional details about the url we want to pull
self.url = None
# doing one instance of this to avoid memory leaks
self.qurl = QUrl()
# id of the webview instance
self.id = id_no
# Status webview instance, green mean it isn't working and yellow means it is.
self.status = 'GREEN'
# Reference to a single universal object all the webview instances can see.
self.thread_flag = thread_flag
def handleLoadFinished(self):
try:
self.processCurrentPage()
except Exception as e:
print e
self.status = 'GREEN'
if not self.fetchNext():
# We're finished!
self.loadFinished.disconnect()
self.stop()
else:
# We're not finished! Do next url.
self.qurl.setUrl(self.url['url'])
self.load(self.qurl)
def processCurrentPage(self):
self.frame = str(self.page().mainFrame().toHtml().toUtf8())
# This is the case for the initial web pages I want to gather links from.
if 'name' in self.url:
# Parse html string for links I'm looking for.
new_links = parse_doc(self.thread_flag.xml_parser, self.url, self.frame)
if len(new_links) == 0: return 0
fkid = self.url['pkid']
new_links = map(lambda x: (fkid, x['title'],x['url'], self.thread_flag.job_id), new_links)
# Post links to database, db de-dupes and then repull ones that made it.
self.thread_flag.db_direct.post_links(new_links)
added_links = self.thread_flag.db_direct.get_links(self.thread_flag.job_id,fkid)
# Add the pulled links to central queue all the qwebviews pull from
dump_list2queue(added_links, self._urls)
del added_links
else:
# Process one of the links I pulled from the initial set of data that was originally in the queue.
print "Processing target link!"
# Get next url from the universal queue!
def fetchNext(self):
if self._urls and self._urls.empty():
self.status = 'GREEN'
return False
else:
self.status = 'YELLOW'
self.url = self._urls.get()
return True
def start(self, urls):
# This is where the reference to the universal queue gets made.
self._urls = urls
if self.fetchNext():
self.qurl.setUrl(self.url['url'])
self.load(self.qurl)
# uq = central url queue shared between webview instances
# ta = array of webview objects
# tf - thread flag (basically just a custom universal object that all the webviews can access).
# This main "program" is started by another script elsewhere.
def main_program(uq, ta, tf):
app = QApplication([])
webviews = ta
threadflag = tf
tf.app = app
print "Beginning the multiple async web calls..."
# Create n "threads" (really just webviews) that each will make asynchronous calls.
for n in range(0,threadflag.threads):
webviews.append(WebView(threadflag, n+1))
webviews[n].start(uq)
app.exec_()
这就是我的记忆工具所说的(它们在整个程序中都是恒定不变的)
2491(MB)
方法描述符9959
功能8342
弱引用6440
元组6418
dict 4982
wrapper_descriptor 4380
getset_descriptor 2314
列表1890
method_descriptor 1445
builtin_function_or_method 1298
一组9879个对象的分区。总大小= 1510000字节。
索引计数%大小%累积%种类(类/类的字典)
0 2646 27 445216 29 445216 29 str 1 563 6 262088 17 707304 47 dict (no owner) 2 2267 23 199496 13 906800 60 __builtin__.weakref 3 2381 24 179128 12 1085928 72 tuple 4 212 2 107744 7 1193672 79 dict of guppy.etc.Glue.Interface 5 50 1 52400 3 1246072 83 dict of guppy.etc.Glue.Share 6 121 1 40200 3 1286272 85 list 7 116 1 32480 2 1318752 87 dict of guppy.etc.Glue.Owner 8 240 2 30720 2 1349472 89 types.CodeType 9 42 0 24816 2 1374288 91 dict of class
答案 0 :(得分:0)
由于C ++代码的存在,您的程序确实正在增长,但是就不再引用的对象的创建而言,这并不是真正的泄漏。至少部分发生了什么事情,您的QWebView拥有一个QWebPage,该QWebPage拥有一个QWebHistory()。每次您调用self.load时,历史记录会变得更长一些。
请注意,QWebHistory具有clear()函数。
可用文档:http://pyqt.sourceforge.net/Docs/PyQt4/qwebview.html#history