Question

我正在尝试使用python Scrapy抓取一个网站。从scrapy shell运行时，xpath表达式会提供所需的输出，但在从spider运行时则不会。没有返回错误，但DEBUG已抓取（200）。这是我的代码： -

 import scrapy
 import logging
 from scrapy.linkextractors import LinkExtractor
 from scrapy.spiders import CrawlSpider, Rule
 class amazon(scrapy.Spider):
 name = "automate"
 start_urls = ['http://www.geeksforgeeks.org/']
 def parse(self, response):
    for href in response.xpath('//div/a[contains(@class,"tag-link-1942 tag-link-position-3")]/@href'):
        url = href.extract()    
        yield scrapy.Request(url, callback=self.parse_item2)
def parse_item2(self, response):
for url in response.xpath('//div/article/header/h2/a/@href'):
        yield 
        {
            'link': url.extract(),
        }
    next_page_url = response.xpath('//div[contains(@class, "wp-pagenavi")]/a[contains(@class, "page larger")]/@href')
    if next_page_url is not None:
        yield 
        {
            scrapy.Request(next_page_url.extract_first(), callback=self.parse_item2)
        }

Answer 1

脚本中的缩进有些令人困惑。如果我正确解释它，我发现它缺乏输出。以下代码适用于我并显示文章的标题，也许它可以帮助您：

import scrapy
import logging
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class amazon(scrapy.Spider):
    name = "automate"
    start_urls = ['http://www.geeksforgeeks.org/']

    def parse(self, response):
    for href in response.xpath('//div/a[contains(@class,"tag-link-1942 tag-link-position-3")]/@href'):
        url = href.extract()    
        yield scrapy.Request(url, callback=self.parse_item2)

    def parse_item2(self, response):
        for url in response.xpath('//div/article/header/h2/a/@href'):
            next_page_url = response.xpath('//div[contains(@class, "wp-pagenavi")]/a[contains(@class, "page larger")]/@href')
            if len(next_page_url):
                print(response.xpath('string(//h2[@class="entry-title"]/a)').extract())
                yield scrapy.Request(next_page_url.extract_first(), callback=self.parse_item2)

Python Scrapy没有提供所需的输出

1 个答案: