Question

我创建了一个抓痒的蜘蛛，但是当我运行命令

scrapy crawl scrapytest -o output.json

它运行，但是输出报告为空白。我知道xpath是正确的，所以我不确定。仍然真的很陌生。感谢您的帮助

import scrapy

class TestspiderSpider(scrapy.Spider):

    name = 'testspider'
    allowed_domains = ['bing.com']
    start_urls = ['http://www.bing.com/']
    url = [
            'https://www.bing.com/search?q=sample+search&FORM=AWRE'
          ]
    def parse(self, response):
        response.xpath('//*[@class="b_algo"]/h2/a/text()').extract()
        yield scrapy.Request(url = url, callback = self.parse)

Answer 1

Bing知道您未使用常规浏览器，因此请尝试使用标头。

尝试在settings.py {仅抓取}中进行以下设置：

USER_AGENT  = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/532.2 (KHTML, like Gecko) ChromePlus/4.0.222.3 Chrome/4.0.222.3 Safari/532.2'

DEFAULT_REQUEST_HEADERS = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'no-cache',
    'cookie': 'MUID=; SRCHD=AF=NOFORM; SRCHUID=1; SRCHUSR=; _EDGE_S=SID=; MUIDB=; _SS=SID=; ipv6=; SRCHHPGUSR=;',
    'pragma': 'no-cache',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
}

Answer 2

您的代码未产生任何数据。

您需要产生字典或Scrapy的Item类的子类实例，其中包含提取的数据，以便该数据到达输出文件。

请参见corresponding section from the Scrapy tutorial。

Scrapy输出文件为空

2 个答案: