不要抓取数据库中保存的URL

时间:2013-10-16 00:01:31

标签: scrapy

我在Mysql数据库中保存已爬网的URL。当scrapy再次抓取网站时,如果网址不在数据库中,则计划或下载程序应仅访问/抓取/下载该网页。

#settings.py
DOWNLOADER_MIDDLEWARES = {
     'myproject.middlewares.RandomUserAgentMiddleware': 400,
     'myproject.middlewares.ProxyMiddleware': 410,
     'myproject.middlewares.DupFilterMiddleware': 390,
     'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
    # Disable compression middleware, so the actual HTML pages are cached
}

#middlewares.py
class DupFilterMiddleware(object):
    def process_response(self, request, response, spider):
        conn = MySQLdb.connect(user='dbuser',passwd='dbpass',db='dbname',host='localhost', charset='utf8', use_unicode=True)
        cursor = conn.cursor()
        log.msg("Make mysql connection", level=log.INFO)

        cursor.execute("""SELECT id FROM scrapy WHERE url = %s""", (response.url))
        if cursor.fetchone() is None:
            return None
        else:
            raise IgnoreRequest("Duplicate --db-- item found: %s" % response.url)

#spider.py
class TestSpider(CrawlSpider):
    name = "test_spider"
    allowed_domains = ["test.com"]
    start_urls = ["http://test.com/company/JV-Driver-Jobs-dHJhZGVzODkydGVhbA%3D%3D"]

    rules = [
        Rule(SgmlLinkExtractor(allow=("http://example.com/job/(.*)",)),callback="parse_items"),
        Rule(SgmlLinkExtractor(allow=("http://example.com/company/",)), follow=True),
    ]

    def parse_items(self, response):
        l = XPathItemLoader(testItem(), response = response)
        l.default_output_processor = MapCompose(lambda v: v.strip(), replace_escape_chars)
        l.add_xpath('job_title', '//h1/text()')
        l.add_value('url',response.url)
        l.add_xpath('job_description', '//tr[2]/td[2]')
        l.add_value('job_code', '99')
        return l.load_item()

它有效,但我收到错误:从raise IgnoreRequest()下载时出错。这是打算吗?

2013-10-15 17:54:16-0600 [test_spider] ERROR: Error downloading <GET http://example.com/job/aaa>: Duplicate --db-- item found: http://example.com/job/aaa

我的方法的另一个问题是我必须查询我要抓取的每个网址。说,我有10k网址爬行,这意味着我打了10k次mysql服务器。 我如何在1个mysql查询中执行此操作? (例如,获取所有已抓取的网址并将其存储在某处,然后检查请求网址)

更新

按照 audiodude 建议,这是我的最新代码。但是,DupFilterMiddleware停止工作。它运行init但不再调用process_request。删除 _ init _ 会使 process_request 再次有效。我做错了什么?

class DupFilterMiddleware(object):
    def __init__(self):
        self.conn = MySQLdb.connect(user='myuser',passwd='mypw',db='mydb',host='localhost', charset='utf8', use_unicode=True)
        self.cursor = self.conn.cursor()

        self.url_set = set()
        self.cursor.execute('SELECT url FROM scrapy')
        for url in self.cursor.fetchall():
            self.url_set.add(url)

        print self.url_set

        log.msg("DupFilterMiddleware Initialize mysql connection", level=log.INFO)

    def process_request(self, request, spider):
        log.msg("Process Request URL:{%s}" % request.url, level=log.WARNING)
        if request.url in url_set:
            log.msg("IgnoreRequest Exception {%s}" % request.url, level=log.WARNING)
            raise IgnoreRequest()
        else:
            return None

1 个答案:

答案 0 :(得分:4)

我能想到的一些事情:

首先,您应该在process_request中使用DupFilterMiddleware。这样,您可以在请求下载之前过滤该请求。您当前的解决方案浪费了大量时间和资源,下载最终被淘汰的页面。

其次,您不应该在process_response / process_request内连接到您的数据库。这意味着您要为每个项目创建一个新连接(并丢弃旧项目)。这是非常低效的。请尝试以下方法:

class DupFilterMiddleware(object):
  def __init__(self):
    self.conn = MySQLdb.connect(...
    self.cursor = conn.cursor()

然后使用cursor.execute(...

替换process_response方法中的self.cursor.execute(...

最后,我同意,点击MySQL服务器10k次可能不是最理想的。对于如此少量的数据,为什么不将它们全部加载到内存中的set()中。将其放在下载器中间件的__init__方法中:

self.url_set = set()
cursor.execute('SELECT url FROM scrapy')
for url in cursor.fetchall():
  self.url_set.add(url)

然后,不要执行查询和检查结果,只需执行以下操作:

if response.url in url_set:
  raise IgnoreRequest(...