Question

这可能是一个重复的问题。我正在尝试运行Scrapy蜘蛛，但无法进行。为什么我收到错误消息“HtmlResponse没有属性urljoin”？如果request_count为3且response_count也为3，则Scrapy统计数据意味着什么？我的代码在这里。我很感激这方面的任何帮助。

import scrapy
from scrapy.http.request import Request
from scrapy.spiders import BaseSpider
from scrapy.selector import HtmlXPathSelector

class BotSpider_2(BaseSpider):
    name = 'BotSpider_2'
    name = "google.co.th"
    start_urls = ["http://www.google.co.th/"]


    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//title/text()').extract()
        print sites

Answer 1

首先，您的导入不正确。例如 - 为什么使用BaseSpider代替Spider？你也没有Selector的导入。关于urljoin错误，您描述了gets我没有看到您发布的代码抛出此错误; urljoin是Response对象的函数，因为scrapy v1左右将当前url与某个路径组合在一起以创建可用于爬网的绝对URL。

$ scrapy shell "https://scrapy.org"
In [1]: response.url
Out[1]: 'https://scrapy.org'

In [2]: response.urljoin('/some/cool/path')
Out[2]: 'https://scrapy.org/some/cool/path'

我已经清理了导入，你的代码就像一个魅力！

import scrapy
from scrapy.selector import Selector

class BotSpider_2(scrapy.Spider):
    name = "google.co.th"
    start_urls = ["http://www.google.co.th/"]


    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//title/text()').extract()
        print(sites)

为什么Scrapy不能抓取/解析？

1 个答案: