如何使用Scrapy 0.24抓取网站并仅解析与RegEx匹配的网页

时间:2015-05-04 17:49:59

标签: python regex scrapy

我在Windows 64位计算机上使用Python 2.7.9上的Scrapy 0.24。我试图告诉scrapy从特定网址http://www.allen-heath.com/products/开始,并且从那里只收集网址中包含字符串ahproducts的网页中的数据。

不幸的是,当我这样做时,根本没有数据被删除。我究竟做错了什么?这是我的代码如下。如果我可以提供更多信息来帮助解答,请询问,我会进行编辑。

以下是我的抓取工具日志的粘贴框:http://pastebin.com/C2QC23m3

谢谢。

import scrapy
import urlparse

from allenheath.items import ProductItem
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor

class productsSpider(scrapy.Spider):
    name = "products"
    allowed_domains = ["http://www.allen-heath.com/"]
    start_urls = [
        "http://www.allen-heath.com/products/"
    ]
    rules = [Rule(LinkExtractor(allow=['ahproducts']), 'parse')]

    def parse(self, response):
        for sel in response.xpath('/html'):
            item = ProductItem()
            item['model'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
            item['itemcode'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
            item['shortdesc'] = sel.css('#prodsingleouter > div > div > h3::text').extract()
            item['desc'] = sel.css('#tab1 #productcontent').extract()
            item['series'] = sel.css('#pagestrip > div > div > a:nth-child(3)::text').extract()
            item['imageorig'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
            item['image_urls'] = sel.css('#tab1 #productcontent .col-sm-9 img').xpath('./@src').extract()
            item['image_urls'] = [urlparse.urljoin(response.url, url) for url in item['image_urls']]
            yield item

在eLRuLL的一些建议之后,这里是我更新的蜘蛛文件。我已经修改了start_url以包含一个页面,该页面的URL包含“ahproducts”。我的原始代码在起始页上没有任何匹配的网址。

products.py

import scrapy
import urlparse

from allenheath.items import ProductItem
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor

class productsSpider(scrapy.contrib.spiders.CrawlSpider):
    name = "products"
    allowed_domains = ["http://www.allen-heath.com/"]
    start_urls = [
        "http://www.allen-heath.com/key-series/ilive-series/ilive-remote-controllers/"
    ]
    rules = (
            Rule(
                LinkExtractor(allow='.*ahproducts.*'),
                callback='parse_item'
                ),
            )

    def parse_item(self, response):
        for sel in response.xpath('/html'):
            item = ProductItem()
            item['model'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
            item['itemcode'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
            item['shortdesc'] = sel.css('#prodsingleouter > div > div > h3::text').extract()
            item['desc'] = sel.css('#tab1 #productcontent').extract()
            item['series'] = sel.css('#pagestrip > div > div > a:nth-child(3)::text').extract()
            item['imageorig'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
            item['image_urls'] = sel.css('#tab1 #productcontent .col-sm-9 img').xpath('./@src').extract()
            item['image_urls'] = [urlparse.urljoin(response.url, url) for url in item['image_urls']]
            yield item

1 个答案:

答案 0 :(得分:2)

首先,要使用规则,您需要使用scrapy.contrib.spiders.CrawlSpider而不是scrapy.Spider

然后,将您的方法名称更改为parse_item而不是parse,并更新您的规则,如:

 rules = (
        Rule(
            LinkExtractor(allow='.*ahproducts.*'),
            callback='parse_item'
        ),
    )

始终将parse方法称为start_urls请求的响应。

最后只将allowed_domains更改为allowed_domains = ["allen-heath.com"]

Pd积。要使用规则对网站的不同级别进行爬网,您需要指定要跟随的链接以及要解析的链接,如下所示:

rules = (
    Rule(
        LinkExtractor(
            allow=('some link to follow')
        ),
        follow=True,
    ),
    Rule(
        LinkExtractor(
            allow=('some link to parse')
        ),
        callback='parse_method',
    ),
)