Question

我将Scrapy 1.5.1与Python 2.7.6一起使用。我正在尝试从以下page中抓取用户名。

我已经实现了以下代码：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request


class BtctalkspiderSpider(scrapy.Spider):
    name = 'btctalkSpider'
    allowed_domains = ['bitcointalk.org']
    max_uid = 10

    def parse(self, response):
        urls = response.xpath("//a/@href").extract()
        for i in range(self.max_uid):
            # scrapy shell "https://bitcointalk.org/index.php?action=profile;u=1"
            yield Request('https://bitcointalk.org/index.php?action=profile;u=%d' % i, callback=self.parse_application)

    def parse_application(self, response):
        userName = response.xpath('//td[normalize-space(.)="Name:"]/following-sibling::td/text()').extract()


        yield {
            'userName': userName
        }

但是，在尝试抓取该网站时，我得到了[]的回信。

我通过外壳检查了xpath，一切似乎正常。

有人建议我在做什么错吗？

Answer 1

某些配置文件的URL根本不存在，因此XPath表达式的计算结果为空。

例如：termios

但是，同样，您需要为ex指定一个起始网址：start_urls = ['https://bitcointalk.org']或仅添加start_requests函数。

以下是Scrapy文档中有关start_urls https://bitcointalk.org/index.php?action=profile;u=2 ...

的引文

代替实现一个start_requests（）方法来生成 scrapy。从URL请求对象，您只需定义一个start_urls 带有URL列表的class属性。

试图刮擦，[回来]

1 个答案: