我正在尝试使用scrapy刮擦http://www.lawncaredirectory.com/findlandscaper.htm,但我不断收到错误消息
raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got NoneType
我曾尝试寻找类似的问题,但没有得到为什么scrapy会给我这个错误的答案。
这是我的蜘蛛
from scrapy import Spider
from lawn.items import LawnItem
import scrapy
import re
class LawnSpider(Spider):
name = "lawn"
allowed_domains = ['www.lawncaredirectory.com']
# Defining the list of pages to scrape
start_urls = ["http://www.lawncaredirectory.com/findlandscaper.htm"]
def parse(self, response):
# Defining rows to be scraped
rows = response.xpath('//ul[@id="horizontal-list"]')
for row in rows:
#getting the link to each state
state = row.xpath('.//*[@id="horizontal-list"]/li[1]/a/@href').extract_first()
item = LawnItem()
item['state'] = state
#Following the link
yield scrapy.Request(state,
callback=self.parse_detail,
meta={'item': item})
# Getting detail insithe each link
def parse_detail(self, response):
item = response.meta['item']
name = response.xpath('.//*[@id="container"]/div[3]/div/div/div/h2/u/text()').extract_first()
答案 0 :(得分:1)
您不检查自己的row.xpath()
结果是否产生了结果:
state = row.xpath('.//*[@id="horizontal-list"]/li[1]/a/@href').extract_first()
state
是None
,因此您会收到该异常。
您将总是在这里获得None
,因为<ul id="horizontal-list">
标签内没有嵌套的标签。表达式.//
只能找到<ul>
标签的子标签,而不能找到标签本身!
充其量您可以使用row.xpath('.//li[1]/a/@href')
来获取嵌套的<a href>
标记,但是如果没有None
标记或第一个{ {1}}标签没有直接嵌套的<li>
标签,或者该标签没有<li>
属性。
接下来,只有一个单个 <a>
标签,因此您的href
循环将只执行一次。
如果要查找<ul id="horizontal-list">
下的所有链接,请直接选择这些链接:
for row in rows:
请记住,您始终可以使用scrapy shell <url>
来试用表达式; scrapy会为您加载命令行中提供的URL,并为您提供<ul>
对象(以及其他对象):
# find all <a href> elements inside <ul id="horizontal-list"><li> elements
# and take the href values.
links = response.xpath('//ul[@id="horizontal-list"]/li//a/@href')
for link in links:
item = LawnItem()
item['state'] = link.get()
yield scrapy.Request(
link,
callback=self.parse_detail,
meta={'item': item}
)
将此与您自己的表达式进行比较:
response
您得到的结果是空的,因此$ bin/scrapy shell --nolog http://www.lawncaredirectory.com/findlandscaper.htm
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x10eaab7c0>
[s] item {}
[s] request <GET http://www.lawncaredirectory.com/findlandscaper.htm>
[s] response <200 http://www.lawncaredirectory.com/findlandscaper.htm>
[s] settings <scrapy.settings.Settings object at 0x10eaab4c0>
[s] spider <DefaultSpider 'default' at 0x10ee4de50>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>> links = response.xpath('//ul[@id="horizontal-list"]/li//a/@href')
>>> len(links)
50
>>> links[0]
<Selector xpath='//ul[@id="horizontal-list"]/li//a/@href' data='http://www.lawncaredirectory.com/statedi'>
>>> links[0].get()
'http://www.lawncaredirectory.com/statedirectory.php?state=Alabama'
>>> links[-1].get()
'http://www.lawncaredirectory.com/statedirectory.php?state=Wyoming'
给您>>> rows = response.xpath('//ul[@id="horizontal-list"]')
>>> len(rows)
1
>>> rows[0]
<Selector xpath='//ul[@id="horizontal-list"]' data='<ul id="horizontal-list">\n\t\t\n<li><a href'>
>>> rows[0].xpath('.//*[@id="horizontal-list"]/li[1]/a/@href')
[]
>>> rows[0].xpath('.//*[@id="horizontal-list"]/li[1]/a/@href').extract_first() is None
True
,因为.extract_first()
找不到任何东西;您将无法再找到与子元素相同的元素,而是使用None
来查找“当前”元素:
.//*[@id="horizontal-list"]
但是无论如何,使用'.'
只会得到一个元素。