我尝试使用scrapy创建我的第一个蜘蛛刮板 我使用Dmoz作为测试,我收到一条错误消息: TypeError:请求url必须是str或unicode,得到NoneType 但是在Debug中我可以看到正确的URL
代码:
import scrapy
import urlparse
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = ["http://www.dmoz.org/search?q=france&all=no&t=regional&cat=all"]
def parse(self, response):
sites = response.css('#site-list-content > div.site-item > div.title-and-desc')
for site in sites:
yield {
'name': site.css('a > div.site-title::text').extract_first().strip(),
'url': site.xpath('a/@href').extract_first().strip(),
'description': site.css('div.site-descr::text').extract_first().strip(),
}
nxt = response.css('#subcategories-div > div.previous-next > div.next-page')
next_page = nxt.css('a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
日志:
2016-10-18 11:17:03 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/search?q=france&start=20&type=next&all=no&t=regional&cat=all> (referer: http://www.dmoz.org/search?q=france&all=no&t=regional&cat=all)
2016-10-18 11:17:03 [scrapy] ERROR: Spider error processing <GET http://www.dmoz.org/search?q=france&start=20&type=next&all=no&t=regional&cat=all> (referer: http://www.dmoz.org/search?q=france&all=no&t=regional&cat=all)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/ENV/bin/tutorial/dirbot/spiders/dmoz.py", line 25, in parse
yield scrapy.Request(next_page, callback=self.parse)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 25, in __init__
self._set_url(url)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 51, in _set_url
raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got NoneType:
2016-10-18 11:17:03 [scrapy] INFO: Closing spider (finished)
2016-10-18 11:17:03 [scrapy] INFO: Stored json feed (20 items) in: test.json
2016-10-18 11:17:03 [scrapy] INFO: Dumping Scrapy stats:
答案 0 :(得分:1)
错误在您的代码中:
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
正如Padraic Cunningham在提交中提到的那样:yield
Request
无论next_page
是None
还是if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
,或者填充了网址。
您可以通过将代码更改为此来解决您的问题:
yield
您将if
放在if
区块内。
顺便说一句,您可以将if next_page:
更改为以下内容:
else
因为Python的真相。
因为您的蜘蛛停止工作,请尝试通过 scrapy shell 调试您的应用程序,您可以在其中查看CSS查询是否返回值。您还可以向上一个if
块添加next_page
,该块会向控制台记录/打印一条未找到class UsersModel(models.Model):
id = models.CharField(db_column="id", max_length=25, primary_key=True)
key = models.CharField(db_column="key", max_length=100)
a = models.CharField(db_column="a",max_length=25, null=True, blank=True)
b = models.BigIntegerField(db_column="b", null=True, blank=True)
的语句,因此您知道该站点或CSS出现问题查询。