我在Windows 64位计算机上使用Python 2.7.9上的Scrapy 0.24。我试图告诉scrapy从特定网址http://www.allen-heath.com/products/
开始,并且从那里只收集网址中包含字符串ahproducts
的网页中的数据。
不幸的是,当我这样做时,根本没有数据被删除。我究竟做错了什么?这是我的代码如下。如果我可以提供更多信息来帮助解答,请询问,我会进行编辑。
以下是我的抓取工具日志的粘贴框:http://pastebin.com/C2QC23m3。
谢谢。
import scrapy
import urlparse
from allenheath.items import ProductItem
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor
class productsSpider(scrapy.Spider):
name = "products"
allowed_domains = ["http://www.allen-heath.com/"]
start_urls = [
"http://www.allen-heath.com/products/"
]
rules = [Rule(LinkExtractor(allow=['ahproducts']), 'parse')]
def parse(self, response):
for sel in response.xpath('/html'):
item = ProductItem()
item['model'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
item['itemcode'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
item['shortdesc'] = sel.css('#prodsingleouter > div > div > h3::text').extract()
item['desc'] = sel.css('#tab1 #productcontent').extract()
item['series'] = sel.css('#pagestrip > div > div > a:nth-child(3)::text').extract()
item['imageorig'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
item['image_urls'] = sel.css('#tab1 #productcontent .col-sm-9 img').xpath('./@src').extract()
item['image_urls'] = [urlparse.urljoin(response.url, url) for url in item['image_urls']]
yield item
在eLRuLL的一些建议之后,这里是我更新的蜘蛛文件。我已经修改了start_url以包含一个页面,该页面的URL包含“ahproducts”。我的原始代码在起始页上没有任何匹配的网址。
products.py
import scrapy
import urlparse
from allenheath.items import ProductItem
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor
class productsSpider(scrapy.contrib.spiders.CrawlSpider):
name = "products"
allowed_domains = ["http://www.allen-heath.com/"]
start_urls = [
"http://www.allen-heath.com/key-series/ilive-series/ilive-remote-controllers/"
]
rules = (
Rule(
LinkExtractor(allow='.*ahproducts.*'),
callback='parse_item'
),
)
def parse_item(self, response):
for sel in response.xpath('/html'):
item = ProductItem()
item['model'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
item['itemcode'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
item['shortdesc'] = sel.css('#prodsingleouter > div > div > h3::text').extract()
item['desc'] = sel.css('#tab1 #productcontent').extract()
item['series'] = sel.css('#pagestrip > div > div > a:nth-child(3)::text').extract()
item['imageorig'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
item['image_urls'] = sel.css('#tab1 #productcontent .col-sm-9 img').xpath('./@src').extract()
item['image_urls'] = [urlparse.urljoin(response.url, url) for url in item['image_urls']]
yield item
答案 0 :(得分:2)
首先,要使用规则,您需要使用scrapy.contrib.spiders.CrawlSpider
而不是scrapy.Spider
。
然后,将您的方法名称更改为parse_item
而不是parse
,并更新您的规则,如:
rules = (
Rule(
LinkExtractor(allow='.*ahproducts.*'),
callback='parse_item'
),
)
始终将parse
方法称为start_urls
请求的响应。
最后只将allowed_domains
更改为allowed_domains = ["allen-heath.com"]
Pd积。要使用规则对网站的不同级别进行爬网,您需要指定要跟随的链接以及要解析的链接,如下所示:
rules = (
Rule(
LinkExtractor(
allow=('some link to follow')
),
follow=True,
),
Rule(
LinkExtractor(
allow=('some link to parse')
),
callback='parse_method',
),
)