我一直试图从食物网络抓取食谱标题,我想递归地转到下一页。我使用的是python 3,所以scrapy中的某些功能对我来说是不可用的,但到目前为止我所拥有的是:
import scrapy
from scrapy.http import Request
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.selector import HtmlXPathSelector
from testspider.items import testspiderItem
from lxml import html
class MySpider(CrawlSpider):
name = "test"
allowed_domains = ["foodnetwork.com"]
start_urls = ["http://www.foodnetwork.com/recipes/aarti-sequeira/middle-eastern-fire-roasted-eggplant-dip-babaganoush-recipe.html"]
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//div[@class="recipe-next"]/a/@href',)), callback="parse_page", follow= True),)
def parse(self, response):
site = html.fromstring(response.body_as_unicode())
titles = site.xpath('//h1[@itemprop="name"]/text()')
for title in titles:
item = testspiderItem()
item["title"] = title
yield item
网页来源的标签是:
<div class="recipe-next">
<a href="/recipes/food-network-kitchens/middle-eastern-eggplant-rounds-recipe.html">Next Recipe</a>
</div>
任何帮助都会受到赞赏!
答案 0 :(得分:0)
CrawlSpider使用parse方法本身,当你覆盖它时,事情会按预期停止工作see the docs。引用文档
编写爬网蜘蛛规则时,请避免使用parse作为回调 CrawlSpider使用parse方法本身来实现其逻辑。 因此,如果您覆盖解析方法,则爬行蜘蛛将不再存在 工作
此外,您的代码段不会显示parse_page()
方法的来源。