如何使用scrapy CSVFeedSpider来抓取其值中包含逗号的Feed?

时间:2014-05-07 15:31:03

标签: csv scrapy web-crawler

我尝试将scrapy CSVFeedSpider用于csv链接 这是一个例子:

号码,"可能包含逗号","可能包含逗号","可能包含逗号",文本,文字,文字,文字,文字和&# 34;可能包含逗号"

如果一个值包含逗号,它被引号括起来,我怎么能实现它,因为它只接受一个分隔符?

http://doc.scrapy.org/en/latest/topics/spiders.html#csvfeedspider

1 个答案:

答案 0 :(得分:0)

如果列被双引号括起来,则内部使用逗号可以正常工作。 如果它被单引号

包围,它会抱怨长度不匹配

这是蜘蛛代码:

# -*- coding: utf-8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from stackoverflow23429315.items import DemoItem
from scrapy.contrib.spiders import CSVFeedSpider
from scrapy import log


class DmozSpider(CSVFeedSpider):
    name = 'csvFeedTest'        
    start_urls = ['file:////home/vagrant/labs/stackoverflow23429315/test.csv']
    delimiter = ','
    headers = ['id', 'name', 'address1', 'address2', 'email']

    def parse_row(self, response, row):
        log.msg('Hi, this is a row!: %r' % row)

        item = DemoItem()
        item['id'] = row['id']
        item['name'] = row['name']
        item['address1'] = row['address1']
        item['address2'] = row['address2']
        item['email'] = row['email']
        return item

项目类别:

from scrapy.item import Item, Field

class DemoItem(Item):
    id = Field()
    name = Field()
    address1 = Field()
    address2 = Field()
    email = Field()

测试csv文件:

1,"John, Doe","1234 Main Street, APT A","2nd Floor",John.Doe@test.com
2,"John2, Doe","1234 Main Street, APT A","2nd Floor",John.Doe@test.com
3,'John3, Doe','1234 Main Street, APT A','2nd Floor',John.Doe@test.com
4,'John4, Doe','1234 Main Street, APT A','2nd Floor',John.Doe@test.com