在易碎的Downolader中间件中使用正则表达式

时间:2018-11-09 07:31:20

标签: python regex python-3.x scrapy scrapy-middleware

我一直在尝试在Scrapy中创建自定义中间件,该中间件将使用正则表达式标记包含某些模式的url。简而言之,这里有一个例外列表,每个URL都会根据它进行检查。但是,中间件无法设法正确识别异常(它总是为re.match()返回None结果)。

我尝试在单独的脚本中实现正则表达式,并且可以正常工作。我真的很感谢关于为什么会发生这种情况的任何想法。

这是示例情况:

1)蜘蛛

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class AmazonSpider(CrawlSpider):
    name = 'amazon'
    allowed_domains = ['amazon.co.uk']
    start_urls = ['http://amazon.co.uk/']

    rules = (
        Rule(LinkExtractor(allow=''), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        i = {}
        i['url'] = response.url
        return i

2)设置:

BOT_NAME = 'foo'

SPIDER_MODULES = ['foo.spiders']
NEWSPIDER_MODULE = 'foo.spiders'

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0'

ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
    'foo.middlewares.FooDownloaderMiddleware': 543,
    'foo.middlewares.TryMiddleware':500,
}

3)我的中间件(即middlewares.py中的新类):

import logging
import re

。 。

class TryMiddleware(object):

def __init__(self):
    self.items_scraped = 0
    self.target = ''
    self.exceptions = []

@classmethod
def from_crawler(cls, crawler):
    s = cls()

    return s

def process_request(self, request, spider):
    self.target = str(request)

    # Just an example, at a later stage, these will be dynamically generated.
    self.exceptions = ['Audible-Audiobook-Downloads','help']

    for i in self.exceptions:
        pattern = re.compile(r'[a-z0-9.:/-]+/{}/[0-9a-z.:/-]+'.format(re.escape(i)))

        if i in self.target:
            m = pattern.match(self.target)
            # This is how I tried checking if the word is contained in the url,
            # and see if regex caught it.
            logger.info(f'\n*\nFound {m} in {target}\n*\n')

    return None

4)这是我的记录器标识的示例:

* 在https://www.amazon.co.uk/gp/help/customer/display.html/ref=footer_cookies_notice?ie=UTF8&nodeId=201890250>中找不到任何内容 *

1 个答案:

答案 0 :(得分:2)

您的代码起作用有效,您正在尝试匹配Audible-Audiobook-Downloads,由于您所查询的网址不存在,该网址返回None,因为 您所看到的。然后它将检查网址中是否存在help,它确实存在并且已经打印了。

在下面的代码中,我检查m是否不是 None,然后打印完整的匹配项。

import logging
import re

exceptions = ['Audible-Audiobook-Downloads','help']

for i in exceptions:
    pattern = re.compile(r'[a-z0-9.:/-]+/{}/[0-9a-z.:/-]+'.format(re.escape(i)))

    m = pattern.match("https://www.amazon.co.uk/gp/help/customer/display.html/ref=footer_cookies_notice?ie=UTF8&nodeId=201890250")
    if m:
        print(m.group(0))