我有以下Scrapy代码,我正在尝试使用以下代码从网站上搜索英超联赛数据:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
class ExampleSpider(CrawlSpider):
name = "goal3"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com/Regions/252/Tournaments/2/Seasons/3853/Stages/7794/PlayerStatistics/England-Premier-League-2013-2014"]
download_delay = 1
rules = [Rule(SgmlLinkExtractor(allow=()),
follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
]
def parse_item(self,response):
self.log('A response from %s just arrived!' % response.url)
scripts = response.selector.xpath("normalize-space(//title)")
for scripts in scripts:
body = response.xpath('//p').extract()
body2 = "".join(body)
print remove_tags(body2).encode('utf-8')
execute(['scrapy','crawl','goal3'])
代码似乎正在做的事情是将它的起点作为英超联赛数据的链接,但随后抓取其中包含的所有链接,即使该链接转到该网站的某个部分相关的英超联赛数据。实际上它最终会抓取整个网站,而不是从主页抓取。
有没有让Scrapy只从你的起点刮取依赖链接?
由于
答案 0 :(得分:1)
您需要配置rules
,以便仅为specific tournament提取链接:
rules = [
Rule(SgmlLinkExtractor(allow=('Regions/252/Tournaments/2', )),
callback='parse_item',
follow=True)
]