如何使用Scrapy获得结构化的JSON输出?

时间:2016-01-24 19:52:31

标签: python json scrapy scrapy-spider

我是Python的新手,最近我正在尝试使用Scrapy来抓取一个包含多个页面的网站,以下是来自我的" spider.py"

的代码段
    def parse(self, response):
        sel = Selector(response)
        tuples = sel.xpath('//*[td[@class = "caption"]]')
        items = []

        for tuple in tuples:
            item = DataTuple()

            keyTemp = tuple.xpath('td[1]').extract()[0]
            key = html2text.html2text(keyTemp).rstrip()
            valueTemp = tuple.xpath('td[2]').extract()[0]
            value = html2text.html2text(valueTemp).rstrip()

            item[key] = value
            items.append(item)
    return items

使用以下命令运行代码:

scrapy crawl dumbSpider -o items.json -t json

它会给出:

{"a":"a-Value"},
{"b":"b-Value"},
{"c":"c-Value"},
{"a":"another-a-Value"},
{"b":"another-b-Value"},
{"c":"another-c-Value"}

但我真的想要这样的东西:

{"a":"a-Value", "b":"b-Value", "c":"c-Value"},
{"a":"another-a-Value", "b":"another-b-Value", "c":"another-c-Value"}

我尝试了几种方法来调整spider.py,例如使用临时列表来存储所有" item"单个网页然后将临时列表附加到"项目"但不知怎的,它不起作用。

已更新:缩进已修复。

1 个答案:

答案 0 :(得分:0)

下面我已经做了一个快速模型,告诉我如何推荐这样做,只要你知道每页的TD数量。您可以根据需要采取部分或全部措施。这可能是针对您的问题而过度设计的(对不起!);你可以只取chunk_by_numbers位并完成....

有几点需要注意:

1)避免使用' tuple'作为变量名称,因为它也是一个内部关键字

2)学会使用生成器/内置插件,因为如果你一次做大量的网站,它们会更快更轻(参见下面的parse_to_kv和chunk_by_number)

3)尝试隔离解析逻辑,如果它发生变化,你可以在一个地方轻松换出(参见下面的extract_td)

4)你的功能不使用' self',所以你应该使用@staticmethod装饰器并从函数中删除这个参数

5)目前输出是一个dict,但如果你需要一个JSON对象,你可以导入json并转储它

def extract_td(item, index):
    # extract logic for my websites which allows extraction
    # of either a key or value from a table data
    # returns a string representation of item[index]
    # this is very page/tool specific!
    td_as_str = "td[%i]" % index
    val = item.xpath(td_as_str).extract()[0]
    return html2text.html2text(val).rstrip()

def parse_to_kv(xpaths):
    # returns key, value pairs from the given
    # this is also page specific
    for xpath in xpaths:
        yield extract_td(xpath, 0), extract_td(xpath, 1)

def chunk_by_number(alist, num):
    # splices alist into chunks of num size.
    # This is a very generic, reusable operation
    for chunk in list(zip(*(iter(alist),) * num)):
        yield chunk

def parse(response, td_per_page):
    # extracts key/value pairs based on the table datas in response
    # yields lists of length td_per_page which contain these key/value extractions
    # this is very specific based on our parse patterns
    sel = Selector(response)
    tuples = sel.xpath('//*[td[@class = "caption"]]')
    kv_generator = parse_to_kv(tuples)

    for page in chunk_by_number(kv_generator, td_per_page):
        print dict(page)