ive启动了一个抓人的项目并创建了这个搜寻器:
import scrapy
class CarSpider(scrapy.Spider):
name = 'Car_Scrape'
page_number = 2
start_urls = [
'https://www.finn.no/car/used/search.html?orgId=9117269&page=1'
]
def parse(self, response):
for quote in response.css('article.ads__unit'):
yield {
'title': quote.css('a.ads__unit__link::text').get(),
'img:url': quote.css('img.img-format__img::attr(src)').get(),
'link': quote.css('a.ads__unit__link::attr(href)').get(),
'model_year': int(quote.css('div.ads__unit__content__keys div:nth-child(1)::text').get()),
'mileage': int(''.join(list(filter(str.isdigit, quote.css('div.ads__unit__content__keys div:nth-child(2)::text').get())))),
'price': int(''.join(list(filter(str.isdigit, quote.css('div.ads__unit__content__keys div:nth-child(3)::text').get())))),
}
问题是当我尝试运行爬网命令时:
scrapy crawl Car_Scrape -o data.json
它只报废了23辆首批汽车。但是当我在scrapy shell中为相同的URL运行此命令时:
for quote in response.css('article.ads__unit'):
print(quote.css('a.ads__unit__link::text').get())
我将整个页面都刮掉了。我希望在CarSpider类中获得相同的结果。我有做错什么吗?如果有人可以检查他们是否遇到了同样的问题,或者是我的项目在做麻烦。非常感谢。
答案 0 :(得分:1)
如果我尝试运行您的蜘蛛,我会得到26个物品,但是它会引发错误:
2020-10-04 19:52:17 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.finn.no/car/used/search.html?orgId=9117269&page=1> (referer: None)
Traceback (most recent call last):
File "c:\program files\python37\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "c:\program files\python37\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "c:\program files\python37\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "c:\program files\python37\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "c:\program files\python37\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "D:\Users\Ivan\Documents\Python\a.py", line 22, in parse
'price': int(''.join(list(filter(str.isdigit, quote.css('div.ads__unit__content__keys div:nth-child(3)::text').get())))),
ValueError: invalid literal for int() with base 10: ''
在页面上,有问题的清单上有Solgt
,您在此处期望价格,但代码无法正确处理。