尝试抓取产品的信息(标题,价格,变体等)并输出到csv。
因此,标题,价格和其他包含1个元素的数据应包含一个单元格,但具有多种变体的数据应为每行一个变体,例如spreadsheet。
在我的代码列中,颜色名称,颜色图像url和颜色变化url都有很多变体,我希望每个新行的每个元素都代替一个单元格中的列表格式。
怎么办?
我的代码:
import scrapy
import string
class KapserSpider(scrapy.Spider):
name = 'kapser2'
allowed_domains = ['xpressprofil.no']
start_urls = ['http://www.xpressprofil.no/sortiment/kapser-hodeplagg/']
def parse(self, response):
items = response.xpath('//div[contains(@class, "listing")]')
for item in items:
first_url = item.xpath('.//a/@href').extract_first()
absolute_url = response.urljoin(first_url)
price = item.xpath('.//span[@class="listing_price "]/text()').extract_first()
yield scrapy.Request(absolute_url, callback=self.parse_item, meta={'price': price})
next_page_url = response.xpath('//li[@rel="next"]/a/@href').extract_first()
absolute_next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(absolute_next_page_url, callback=self.parse)
def parse_item(self, response):
item = {}
item['title'] = response.xpath('//h3[@class="margin-top-0"]/text()').extract_first()
item['price'] = response.meta['price']
item['category'] = response.css('.breadcrumb li:nth-child(2) a::text').extract_first()
item['sub_category'] = response.css('.breadcrumb li:nth-last-child(2) a::text').extract_first()
item['image'] = response.xpath('//a[@class="thumbnail"]/img/@src').extract_first()
item['url'] = response.url
item['description'] = response.xpath('//div[contains(@class, "product-details")]//p/text()').extract_first()
sale_price = map(string.strip, response.xpath('//th[contains(@class, "text-right")]/text()').extract())
sale_price_normalize = [x.encode('utf-8') for x in sale_price]
min_order_amount = map(string.strip, response.xpath('//td[contains(@class, "text-right")]/text()').extract())
min_order_amount_normalize = [y.encode('utf-8') for y in min_order_amount]
item['sales'] = dict(zip(sale_price_normalize, min_order_amount_normalize))
item['color'] = response.xpath('//div[contains(@class, "product-details")]//a[@href[contains(., "htm")]]/img/@alt').extract()
item['gtin_barcode'] = response.xpath('//div[@id="spec"]//td[contains(., "EAN")]/following-sibling::td[1]/text()').extract()
item['brand'] = ""
item['mpn'] = response.xpath('//div[@id="spec"]//td[contains(., "Artikkelnummer")]/following-sibling::td[1]/text()').extract()
item['delivery_time'] = response.xpath('//div[@id="spec"]//td[contains(., "Leveransetid") or contains(., "Leveringstid")]/following-sibling::td[1]/text()').extract()
#variations = response.xpath('//select[@data-name="Merking"]//option/text()').extract()
yield item
更新:是的,How can i export scraped data to csv file in the right format?已经解决,我的问题很相似,但是对我来说不起作用。我仍然在一行中输出数据[红色,绿色,黑色],而不是每个变体一行。 据我了解,问题在于在exporters.py中迭代抛出值,但我无法猜到哪里
from itertools import izip_longest
from scrapy.contrib.exporter import CsvItemExporter
from scrapy.conf import settings
class NewLineRowCsvItemExporter(CsvItemExporter):
def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
super(NewLineRowCsvItemExporter, self).__init__(file, include_headers_line, join_multivalued, **kwargs)
def export_item(self, item):
if self._headers_not_written:
self._headers_not_written = False
self._write_headers_and_set_fields_to_export(item)
fields = self._get_serialized_fields(item, default_value='',
include_empty=True)
values = list(self._build_row(x for _, x in fields))
values = [
(val[0] if len(val) == 1 and type(val[0]) in (list, tuple) else val)
if type(val) in (list, tuple)
else (val, )
for val in values]
multi_row = izip_longest(*values, fillvalue='')
for row in multi_row:
self.csv_writer.writerow(row)
csv中的输出数据格式:Basic bomullscaps med 5 paneler,Kapser&Hodeplagg ,, Kapser,http://www.xpressprofil.no/sortiment/kapser-hodeplagg/kapser/89545340-225584/basic-bomullscaps-med-5-paneler.htm ,,“ 16,22 kr” ,,“ {'1':'17,91',' 100':'17,05','250':'16,55','1000':'16,22'}“,” Hvit,Svart ensfarget,Marineblå,Kongeblå,Rød,Kaki,Grå,Eplegrønn,Oransje ,“ Gul”,11106604,// prodimg.unpr.io/itemimage/800/products/500/159637103c53ba.jpg,Basic bomullscaps med 5专家。 Bomull。 155-160克/平方米。
我在这里有列表和字典,也许就是问题所在?