Question

我正在尝试使用python和Scrapy从Subway UK Restaurant Finder抓取商店位置数据。我已设法抓取单个页面，但我想将其设置为在链接末尾运行1000个递归ID列表。任何帮助将不胜感激。

免责声明：我不知道我在做什么

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from subway.items import SubwayFinder

class MySpider(BaseSpider):
name = "subway"
allowed_domains = ["http://www.subway.co.uk/"]
start_urls = ["http://www.subway.co.uk/business/storefinder/store-detail.aspx?id=453056039"]

def parse(self, response):
  hxs = HtmlXPathSelector(response)
  titles = hxs.select("//div[@class='mid']")
  items = []
  for titles in titles:
      item = SubwayFinder()
      item ["title"] = titles.select("p/span/text()").extract()
      items.append(item)
  return items

Answer 1

如你的代码所示，蜘蛛函数可以返回（或产生）项目，但它也可以返回/产生Requests，scrapy会将项目发送到配置的管道并调用这些请求进行进一步的抓取，取一个查看Request字段，回调函数是将通过响应调用的函数。

为了刮取多个商店位置，您必须查找包含所有商店链接的网址模式或索引页。

例如：

http://www.subway.co.uk/business/storefinder/store-detail.aspx?id=453056039

看起来不适合循环所有商店ID，调用453056039 http请求可能不是一个好主意。

我在网站上找不到索引页面，最接近的可能是将start_urls设置为'www.subway.co.uk/business/storefinder/search.aspx?pc=' + range(1,10)或其他一些可以证明更好的数字，并进一步抓取链接显示在每个页面上，还要注意，幸运的是scrapy不会刮两次页面（除非告诉）所以在多个索引页面中出现的商店详细信息页面不是问题

Answer 2

您可以使用BaseSpider

而不是CrawlSpider

使用crawlspiders查看此link。

您需要为scrapy定义rules才能浏览网页。这些规则将定义您希望scrapy允许抓取的网站和链接。

您可以查看此example以获取有关结构的示例抓取蜘蛛

btw，请考虑从docs：

更改函数名称

Warning

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

使用Scrapy刮擦递归页面数据

2 个答案: