我已经为this page构建了(使用stackoverflow的帮助)抓取工具,但结果是空白的。虽然单页蜘蛛可以工作并擦除所有必需的项目,但下一页的爬虫不会。我不明白这里有什么问题。
以下是抓取工具:
from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from mymobile.items import MymobileItem
class MmobySpider(CrawlSpider):
name = "mmoby2"
allowed_domains = "http://mymobile.ge"
start_urls = [
"http://mymobile.ge/new/v2.php?cat=69"
]
rule = (Rule(SgmlLinkExtractor(allow=("new/v2.php?cat=69&pnum=\d*", ))
, callback="parse_items", follow=True),)
def parse_items(self, response):
sel = Selector(response)
titles = sel.xpath('//table[@width="1000"]//td/table[@class="probg"]')
items = []
for t in titles:
url = sel.xpath('tr//a/@href').extract()
item = MymobileItem()
item["brand"] = t.xpath('tr[2]/td/text()').re('^([\w\-]+)')
item["model"] = t.xpath('tr[2]/td/text()').re('\s+(.*)$')
item["price"] = t.xpath('tr[3]/td//text()').re('^([0-9\.]+)')
item["url"] = urljoin("http://mymobile.ge/new", url[0])
items.append(item)
return(items)
答案 0 :(得分:1)
两个主要问题:
该属性名为rules
,而不是rule
:
rules = (Rule(SgmlLinkExtractor(allow=("new/v2.php?cat=69&pnum=\d*", )),
callback="parse_items",
follow=True), )
allowed_domains
应该是一个列表:
allowed_domains = ["mymobile.ge"]
此外,您需要像Paul在评论中建议的那样调整正则表达式。