我正在尝试将Scrapy限制为特定的XPath位置以获取以下链接。 XPath是正确的(根据chrome的XPath Helper插件),但是当我运行我的Crawl Spider时,我的规则会出现语法错误。
我的蜘蛛代码是:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import BassItem
import logging
from scrapy.log import ScrapyFileLogObserver
logfile = open('testlog.log', 'w')
log_observer = ScrapyFileLogObserver(logfile, level=logging.DEBUG)
log_observer.start()
class BassSpider(CrawlSpider):
name = "bass"
allowed_domains = ["talkbass.com"]
start_urls = ["http://www.talkbass.com/forum/f126"]
rules = [Rule(SgmlLinkExtractor(allow=['/f126/index*']), callback='parse_item', follow=True, restrict_xpaths=('//a[starts-with(@title,"Next ")]')]
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
ads = hxs.select('//table[@id="threadslist"]/tbody/tr/td[@class="alt1"][2]/div')
items = []
for ad in ads:
item = BassItem()
item['title'] = ad.select('a/text()').extract()
item['link'] = ad.select('a/@href').extract()
items.append(item)
return items
因此在规则内部,XPath'// a [starts-with(@title,“Next”)]'返回错误,我不知道为什么,因为实际的XPath是有效的。我只是想让蜘蛛抓住每个“下一页”链接。谁能帮我吗。如果您需要我的代码的任何其他部分以求帮助,请告诉我。
答案 0 :(得分:1)
问题不在于xpath,而是完整规则的语法不正确。以下规则修复了语法错误,但应检查以确保它正在执行所需操作:
rules = (Rule(SgmlLinkExtractor(allow=['/f126/index*'], restrict_xpaths=('//a[starts-with(@title,"Next ")]')),
callback='parse_item', follow=True, ),
)
一般来说,强烈建议在问题中发布实际错误,因为错误和实际错误的感知可能会有所不同。