Question

我在scrapy中使用一个简单的爬行蜘蛛，用于Python 2.7和Scrapy版本1.1.2。它看起来像下面的代码。我有一个分析文本的函数（让我们称之为MyFunction）并返回true或false，我想只关注其中被抓取的URL的response.body返回true的URL。

class MySpider(CrawlSpider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['example.com']

rules = ( Rule(LinkExtractor(allow=('something/', ), deny=('something=', )), callback = 'parse_item', follow = True)

def parse_item(self, response):
        ru = response.url
        self.logger.info('Hi, this is an item page! %s',ru )
        #does some regex here to extract text and saves to a file
        item = myItemClass()
        #populates the field in my item. I don't know if I really need to do this.
        return item

我已经研究过process_links和process_request，但它们似乎无法访问实际的response.body。我已经考虑过在我的process_links（下面的示例）函数中实际发出一个新请求，但是这个运行速度非常慢，我不确定它是否按预期工作。它每分钟大约获取1个URL而不是每秒10+。

def link_filtering(self,links):
    ret = []
    for link in links:
        req = urllib2.Request(link)
        response = urllib2.urlopen(req)
        the_page = response.read()
        if MyFunction(the_page): ret.append(link) # My function is some custom function that always returns true or false on a given text input
    return ret

举个例子，让我们假装MyFunction检查子字符串'asdf'是否在字符串中。我想在我的开始页面上获取所有URL，然后只关注response.body中包含'asdf'的页面上的URL。

使用scrapy仅遵循response.body（）匹配任意条件的URL

0 个答案: