我的网站名称为https://www.grohe.com/in 在那个页面我想得到一种类型的浴室龙头https://www.grohe.com/in/25796/bathroom/bathroom-faucets/grandera/ 在那个页面中有多个产品/相关产品。我想得到每个产品的网址并废弃数据。因为我这样写的......
我的items.py文件看起来像
from scrapy.item import Item, Field
class ScrapytestprojectItem(Item):
producturl=Field()
imageurl=Field()
description=Field()
蜘蛛代码是
import scrapy
from ScrapyTestProject.items import ScrapytestprojectItem
class QuotesSpider(scrapy.Spider):
name = "nestedurl"
allowed_domains = ['www.grohe.com']
start_urls = [
'https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/',
]
def parse(self, response):
for divs in response.css('div.viewport div.workspace div.float-box'):
item = {'producturl': divs.css('a::attr(href)').extract(),
'imageurl': divs.css('a img::attr(src)').extract(),
'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()}
next_page = response.urljoin(item['producturl'])
yield scrapy.Request(next_page, callback=self.parse, meta={'item': item})
当我经营scrapy时 ** scrapy crawl nestedurl -o nestedurl.csv ** 已创建空文件。 控制台是
2017-02-15 18:03:11 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-02-15 18:03:13 [scrapy] DEBUG: Crawled (200) <GET https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/> (referer: None)
2017-02-15 18:03:13 [scrapy] ERROR: Spider error processing <GET https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/> (referer: None)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/lib/python2.7/dist- packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
for x in result:
File "/usr/lib/python2.7/dist- packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/lib/python2.7/dist- packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/lib/python2.7/dist- packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/pradeep/ScrapyTestProject/ScrapyTestProject/spiders/nestedurl.py", line 15, in parse
next_page = response.urljoin(item['producturl'])
File "/usr/lib/python2.7/dist-packages/scrapy/http/response/text.py", line 72, in urljoin
return urljoin(get_base_url(self), url)
File "/usr/lib/python2.7/urlparse.py", line 261, in urljoin
urlparse(url, bscheme, allow_fragments)
File "/usr/lib/python2.7/urlparse.py", line 143, in urlparse
tuple = urlsplit(url, scheme, allow_fragments)
File "/usr/lib/python2.7/urlparse.py", line 176, in urlsplit
cached = _parse_cache.get(key, None)
TypeError: unhashable type: 'list'
2017-02-15 18:03:13 [scrapy] INFO: Closing spider (finished)
2017-02-15 18:03:13 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 253,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 31063,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 2, 15, 12, 33, 13, 396542),
'log_count/DEBUG': 3,
'log_count/ERROR': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/TypeError': 1,
'start_time': datetime.datetime(2017, 2, 15, 12, 33, 11, 568424)}
2017-02-15 18:03:13 [scrapy] INFO: Spider closed (finished)
答案 0 :(得分:0)
我认为项目divs.css('a::attr(href)').extract()
有时会返回一个列表,当在urljoin中使用时会导致urlparse崩溃,因为它无法对列表进行哈希处理。
答案 1 :(得分:0)
未正确生成网址。
您应该启用日志记录,并记录一些消息以调试您的代码。
import scrapy, logging
from ScrapyTestProject.items import ScrapytestprojectItem
class QuotesSpider(scrapy.Spider):
name = "nestedurl"
allowed_domains = ['www.grohe.com']
start_urls = [
'https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/',
]
def parse(self, response):
for divs in response.css('div.viewport div.workspace div.float-box'):
item = {'producturl': divs.css('a::attr(href)').extract(),
'imageurl': divs.css('a img::attr(src)').extract(),
'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()}
next_page = response.urljoin(item['producturl'])
logging.info(next_page ) # see what it prints in console.
yield scrapy.Request(next_page, callback=self.parse, meta={'item': item})
答案 2 :(得分:0)
item = {'producturl': divs.css('a::attr(href)').extract(), # <--- issue here
'imageurl': divs.css('a img::attr(src)').extract(),
'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()}
next_page = response.urljoin(item['producturl']) # <--- here item['producturl'] is a list
要解决此问题,请使用.extract_first('')
:
item = {'producturl': divs.css('a::attr(href)').extract_fist(''),
'imageurl': divs.css('a img::attr(src)').extract_first(''),
'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()}
next_page = response.urljoin(item['producturl'])