Question

我正在尝试使用scrapy刮擦http://www.lawncaredirectory.com/findlandscaper.htm，但我不断收到错误消息

    raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got NoneType

我曾尝试寻找类似的问题，但没有得到为什么scrapy会给我这个错误的答案。

这是我的蜘蛛

from scrapy import Spider
from lawn.items import LawnItem
import scrapy
import re 

class LawnSpider(Spider):
    name = "lawn"
    allowed_domains = ['www.lawncaredirectory.com']
    # Defining the list of pages to scrape
    start_urls = ["http://www.lawncaredirectory.com/findlandscaper.htm"] 

    def parse(self, response):
        # Defining rows to be scraped
        rows = response.xpath('//ul[@id="horizontal-list"]')
        for row in rows:
            #getting the link to each state
            state = row.xpath('.//*[@id="horizontal-list"]/li[1]/a/@href').extract_first()

            item = LawnItem()
            item['state'] = state

            #Following the link  
            yield scrapy.Request(state,
                                 callback=self.parse_detail,
                                 meta={'item': item})
    # Getting detail insithe each link
    def parse_detail(self, response):
        item = response.meta['item']

        name = response.xpath('.//*[@id="container"]/div[3]/div/div/div/h2/u/text()').extract_first()

Answer 1

您不检查自己的row.xpath()结果是否产生了结果：

state = row.xpath('.//*[@id="horizontal-list"]/li[1]/a/@href').extract_first()

state是None，因此您会收到该异常。

您将总是在这里获得None，因为<ul id="horizontal-list">标签内没有嵌套的标签。表达式.//只能找到<ul>标签的子标签，而不能找到标签本身！

充其量您可以使用row.xpath('.//li[1]/a/@href')来获取嵌套的<a href>标记，但是如果没有None标记或第一个{ {1}}标签没有直接嵌套的<li>标签，或者该标签没有<li>属性。

接下来，只有一个单个 <a>标签，因此您的href循环将只执行一次。

如果要查找<ul id="horizontal-list">下的所有链接，请直接选择这些链接：

for row in rows:

请记住，您始终可以使用scrapy shell <url>来试用表达式； scrapy会为您加载命令行中提供的URL，并为您提供<ul>对象（以及其他对象）：

# find all <a href> elements inside <ul id="horizontal-list"><li> elements
# and take the href values.
links = response.xpath('//ul[@id="horizontal-list"]/li//a/@href')
for link in links:
    item = LawnItem()
    item['state'] = link.get()
    yield scrapy.Request(
        link,
        callback=self.parse_detail,
        meta={'item': item}
    )

将此与您自己的表达式进行比较：

response

您得到的结果是空的，因此$ bin/scrapy shell --nolog http://www.lawncaredirectory.com/findlandscaper.htm [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x10eaab7c0> [s] item {} [s] request <GET http://www.lawncaredirectory.com/findlandscaper.htm> [s] response <200 http://www.lawncaredirectory.com/findlandscaper.htm> [s] settings <scrapy.settings.Settings object at 0x10eaab4c0> [s] spider <DefaultSpider 'default' at 0x10ee4de50> [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browser >>> links = response.xpath('//ul[@id="horizontal-list"]/li//a/@href') >>> len(links) 50 >>> links[0] <Selector xpath='//ul[@id="horizontal-list"]/li//a/@href' data='http://www.lawncaredirectory.com/statedi'> >>> links[0].get() 'http://www.lawncaredirectory.com/statedirectory.php?state=Alabama' >>> links[-1].get() 'http://www.lawncaredirectory.com/statedirectory.php?state=Wyoming'给您>>> rows = response.xpath('//ul[@id="horizontal-list"]') >>> len(rows) 1 >>> rows[0] <Selector xpath='//ul[@id="horizontal-list"]' data='<ul id="horizontal-list">\n\t\t\n<li><a href'> >>> rows[0].xpath('.//*[@id="horizontal-list"]/li[1]/a/@href') [] >>> rows[0].xpath('.//*[@id="horizontal-list"]/li[1]/a/@href').extract_first() is None True，因为.extract_first()找不到任何东西；您将无法再找到与子元素相同的元素，而是使用None来查找“当前”元素：

.//*[@id="horizontal-list"]

但是无论如何，使用'.'只会得到一个元素。

查找链接的XPath表达式给出“ TypeError：请求url必须为str或unicode，得到NoneType”

1 个答案: