Scrapy抓取命令无法正确抓取

时间:2020-10-04 11:28:29

标签: python scrapy

ive启动了一个抓人的项目并创建了这个搜寻器:

import scrapy

class CarSpider(scrapy.Spider):

    name = 'Car_Scrape'
    page_number = 2

    start_urls = [
        'https://www.finn.no/car/used/search.html?orgId=9117269&page=1'
    ]

    def parse(self, response):

        for quote in response.css('article.ads__unit'):

            yield {
                'title': quote.css('a.ads__unit__link::text').get(),
                'img:url': quote.css('img.img-format__img::attr(src)').get(),
                'link': quote.css('a.ads__unit__link::attr(href)').get(),
                'model_year': int(quote.css('div.ads__unit__content__keys div:nth-child(1)::text').get()),
                'mileage': int(''.join(list(filter(str.isdigit, quote.css('div.ads__unit__content__keys div:nth-child(2)::text').get())))),
                'price':  int(''.join(list(filter(str.isdigit, quote.css('div.ads__unit__content__keys div:nth-child(3)::text').get())))),
            }

问题是当我尝试运行爬网命令时:

scrapy crawl Car_Scrape -o data.json

它只报废了23辆首批汽车。但是当我在scrapy shell中为相同的URL运行此命令时:

for quote in response.css('article.ads__unit'):
     print(quote.css('a.ads__unit__link::text').get())

我将整个页面都刮掉了。我希望在CarSpider类中获得相同的结果。我有做错什么吗?如果有人可以检查他们是否遇到了同样的问题,或者是我的项目在做麻烦。非常感谢。

1 个答案:

答案 0 :(得分:1)

如果我尝试运行您的蜘蛛,我会得到26个物品,但是它会引发错误:

2020-10-04 19:52:17 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.finn.no/car/used/search.html?orgId=9117269&page=1> (referer: None)
Traceback (most recent call last):
  File "c:\program files\python37\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\program files\python37\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\program files\python37\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\program files\python37\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\program files\python37\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "D:\Users\Ivan\Documents\Python\a.py", line 22, in parse
    'price':  int(''.join(list(filter(str.isdigit, quote.css('div.ads__unit__content__keys div:nth-child(3)::text').get())))),
ValueError: invalid literal for int() with base 10: ''

在页面上,有问题的清单上有Solgt,您在此处期望价格,但代码无法正确处理。