Webkit_server(从python的dryscrape调用)在访问每个页面时使用越来越多的内存。如何减少使用的内存?

时间:2015-08-25 18:45:49

标签: python-3.x memory web-scraping webkit qwebkit

我正在public static void client_DownloadFileCompleted(object sender, AsyncCompletedEventArgs e) { //Do stuff like comparing the file to another, renaming, copying, etc. } 使用dryscrape撰写剪贴簿。我试图在一个报废会话期间访问数百个不同的网址,并点击每个网址上的大约10个ajax页面(不访问每个ajax页面的不同网址)。我需要类似python3的东西,因为我需要能够与javascript组件进行交互。我为我的需求编写的类工作,但是当我访问了大约50或100页时(我使用了所有4Gb内存并且4Gbs的交换磁盘空间几乎100%已满),我的内存不足。我查看了什么耗尽了内存,似乎dryscrape进程负责所有内容。为什么会发生这种情况,我该如何避免呢?

以下是我班级和主要方法的相关摘要。

以下是使用webkit_server的类,您可以确切地看到我正在使用的设置。

dryscape

以下是主要方法(摘录):

import dryscrape
from lxml import html
from time import sleep
from webkit_server import InvalidResponseError
import re

from utils import unugly, my_strip, cleanhtml, stringify_children
from Profile import Profile, Question

class ExampleSession():

    def __init__(self, settings):
        self.settings = settings
        # dryscrape.start_xvfb()
        self.br = self.getBrowser()

    def getBrowser(self):
        session = dryscrape.Session()
        session.set_attribute('auto_load_images', False)
        session.set_header('User-agent', 'Google Chrome')
        return session

    def login(self):
        try:
            print('Trying to log in... ')
            self.br.visit('https://www.example.com/login')                        
            self.br.at_xpath('//*[@id="login_username"]').set(self.settings['myUsername'])
            self.br.at_xpath('//*[@id="login_password"]').set(self.settings['myPassword'])
            q = self.br.at_xpath('//*[@id="loginbox_form"]')
            q.submit()
        except Exception as e:
            print(str(e))
            print('\tException and couldn\'t log in!')
            return
        print('Logged in as %s' % (str(self.settings['myUsername']))) 

    def getProfileQuestionsByUrl(self, url, thread_id=0):
        self.br.visit(str(url.rstrip()) + '/questions')

        tree = html.fromstring(self.br.body())
        questions = []

        num_pages = int(my_strip(tree.xpath('//*[@id="questions_pages"]//*[@class="last"]')[0].text))

        page = 0
        while (page < num_pages):
            sleep(0.5)
            # Do something with each ajax page
            # Next try-except tries to click the 'next' button
            try:
                next_button = self.br.at_xpath('//*[@id="questions_pages"]//*[@class="next"]')
                next_button.click()
            except Exception as e:
                pass                
            page = page + 1

        return questions

    def getProfileByUrl(self, url, thread_id=0):
        missing = 'NA'

        try:
            try:
                # Visit a unique url
                self.br.visit(url.rstrip())
            except Exception as e:
                print(str(e))
                return None
            tree = html.fromstring(self.br.body())

            map = {}
            # Fill up the dictionary with some things I find on the page

            profile = Profile(map)    
            return profile
        except Exception as e:
            print(str(e))
            return None

我是否正确设置了def getProfiles(settings, urls, thread_id): exampleSess = ExampleSession(settings) exampleSess.login() profiles = [] ''' I want to visit at most a thousand unique urls (but I don't care if it will take 2 hours or 2 days as long as the session doesn't fatally break and my laptop doesn't run out of memory) ''' for url in urls: try: profile = exampleSess.getProfileByUrl(url, thread_id) if (profile is not None): profiles.append(profile) try: if (settings['scrapeQuestions'] == 'yes'): profile_questions = exampleSess.getProfileQuestionsByUrl(url, thread_id) if (profile_questions is not None): profile.add_questions(profile_questions) except SocketError as e: print(str(e)) print('\t[Thread %d] SocketError in getProfileQuestionsByUrl of profile...' % (thread_id)) except Exception as e: print(str(e)) print('\t[Thread %d] Exception while getting profile %s' % (thread_id, str(url.rstrip()))) okc.br.reset() exampleSession = None # Does this kill my dryscrape session and prevents webkit_server from running? return profiles dryscrape dryscrape webkit_server urls getProfileByUrl getProfileQuestionsByUrl <table id="gallery"> <tr> <td><a href="largeimg.php?imageID=images/seascape/medium.Vietnam 19.jpg&caption= Vietnam, Cat Ba Island, Vietnam - 2015"><img src="images/seascape/thumb.Vietnam 19.jpg" height="75" ></a></td> <td><a href="largeimg.php?imageID=images/seascape/medium.Vietnam 17.jpg&caption= Vietnam, Cat Ba Island, Vietnam - 2015"><img src="images/seascape/thumb.Vietnam 17.jpg" height="75" ></a></td> <td><a href="largeimg.php?imageID=images/seascape/medium.Vietnam 14.jpg&caption= Vietnam, Cat Ba Island, Vietnam - 2015"><img src="images/seascape/thumb.Vietnam 14.jpg" height="75" ></a></td> </tr> <figure> <img src="<?php echo $_GET['imageID']; ?>"> <figcaption class="big"><?php echo $_GET['caption']; ?></figcaption> </figure> 访问量{{1}}越多越{{1}}我是否缺少任何可能会增加内存使用的设置?

1 个答案:

答案 0 :(得分:2)

我无法解决内存问题(我可以在单独的笔记本电脑上重现此问题)。我最终从dryscrape切换到selenium(然后转到phantomjs)。 PhantomJS在我看来一直很优越,也没有占用大量内存。