查找链接的XPath表达式给出“ TypeError:请求url必须为str或unicode,得到NoneType”

时间:2019-11-27 15:44:56

标签: python xpath scrapy

我正在尝试使用scrapy刮擦http://www.lawncaredirectory.com/findlandscaper.htm,但我不断收到错误消息

    raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got NoneType

我曾尝试寻找类似的问题,但没有得到为什么scrapy会给我这个错误的答案。

这是我的蜘蛛

from scrapy import Spider
from lawn.items import LawnItem
import scrapy
import re 

class LawnSpider(Spider):
    name = "lawn"
    allowed_domains = ['www.lawncaredirectory.com']
    # Defining the list of pages to scrape
    start_urls = ["http://www.lawncaredirectory.com/findlandscaper.htm"] 

    def parse(self, response):
        # Defining rows to be scraped
        rows = response.xpath('//ul[@id="horizontal-list"]')
        for row in rows:
            #getting the link to each state
            state = row.xpath('.//*[@id="horizontal-list"]/li[1]/a/@href').extract_first()

            item = LawnItem()
            item['state'] = state

            #Following the link  
            yield scrapy.Request(state,
                                 callback=self.parse_detail,
                                 meta={'item': item})
    # Getting detail insithe each link
    def parse_detail(self, response):
        item = response.meta['item']

        name = response.xpath('.//*[@id="container"]/div[3]/div/div/div/h2/u/text()').extract_first()

1 个答案:

答案 0 :(得分:1)

您不检查自己的row.xpath()结果是否产生了结果:

state = row.xpath('.//*[@id="horizontal-list"]/li[1]/a/@href').extract_first()

stateNone,因此您会收到该异常。

您将总是在这里获得None,因为<ul id="horizontal-list">标签内没有嵌套的标签。表达式.//只能找到<ul>标签的子标签,而不能找到标签本身!

充其量您可以使用row.xpath('.//li[1]/a/@href')来获取嵌套的<a href>标记,但是如果没有None标记或第一个{ {1}}标签没有直接嵌套的<li>标签,或者该标签没有<li>属性。

接下来,只有一个单个 <a>标签,因此您的href循环将只执行一次。

如果要查找<ul id="horizontal-list">下的所有链接,请直接选择这些链接

for row in rows:

请记住,您始终可以使用scrapy shell <url>来试用表达式; scrapy会为您加载命令行中提供的URL,并为您提供<ul>对象(以及其他对象):

# find all <a href> elements inside <ul id="horizontal-list"><li> elements
# and take the href values.
links = response.xpath('//ul[@id="horizontal-list"]/li//a/@href')
for link in links:
    item = LawnItem()
    item['state'] = link.get()
    yield scrapy.Request(
        link,
        callback=self.parse_detail,
        meta={'item': item}
    )

将此与您自己的表达式进行比较:

response

您得到的结果是空的,因此$ bin/scrapy shell --nolog http://www.lawncaredirectory.com/findlandscaper.htm [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x10eaab7c0> [s] item {} [s] request <GET http://www.lawncaredirectory.com/findlandscaper.htm> [s] response <200 http://www.lawncaredirectory.com/findlandscaper.htm> [s] settings <scrapy.settings.Settings object at 0x10eaab4c0> [s] spider <DefaultSpider 'default' at 0x10ee4de50> [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browser >>> links = response.xpath('//ul[@id="horizontal-list"]/li//a/@href') >>> len(links) 50 >>> links[0] <Selector xpath='//ul[@id="horizontal-list"]/li//a/@href' data='http://www.lawncaredirectory.com/statedi'> >>> links[0].get() 'http://www.lawncaredirectory.com/statedirectory.php?state=Alabama' >>> links[-1].get() 'http://www.lawncaredirectory.com/statedirectory.php?state=Wyoming' 给您>>> rows = response.xpath('//ul[@id="horizontal-list"]') >>> len(rows) 1 >>> rows[0] <Selector xpath='//ul[@id="horizontal-list"]' data='<ul id="horizontal-list">\n\t\t\n<li><a href'> >>> rows[0].xpath('.//*[@id="horizontal-list"]/li[1]/a/@href') [] >>> rows[0].xpath('.//*[@id="horizontal-list"]/li[1]/a/@href').extract_first() is None True ,因为.extract_first()找不到任何东西;您将无法再找到与子元素相同的元素,而是使用None来查找“当前”元素:

.//*[@id="horizontal-list"]

但是无论如何,使用'.'只会得到一个元素。