Question

我是Python的新手，最近我正在尝试使用Scrapy来抓取一个包含多个页面的网站，以下是来自我的＆＃34; spider.py＆＃34;

的代码段

    def parse(self, response):
        sel = Selector(response)
        tuples = sel.xpath('//*[td[@class = "caption"]]')
        items = []

        for tuple in tuples:
            item = DataTuple()

            keyTemp = tuple.xpath('td[1]').extract()[0]
            key = html2text.html2text(keyTemp).rstrip()
            valueTemp = tuple.xpath('td[2]').extract()[0]
            value = html2text.html2text(valueTemp).rstrip()

            item[key] = value
            items.append(item)
    return items

使用以下命令运行代码：

scrapy crawl dumbSpider -o items.json -t json

它会给出：

{"a":"a-Value"},
{"b":"b-Value"},
{"c":"c-Value"},
{"a":"another-a-Value"},
{"b":"another-b-Value"},
{"c":"another-c-Value"}

但我真的想要这样的东西：

{"a":"a-Value", "b":"b-Value", "c":"c-Value"},
{"a":"another-a-Value", "b":"another-b-Value", "c":"another-c-Value"}

我尝试了几种方法来调整spider.py，例如使用临时列表来存储所有＆＃34; item＆＃34;单个网页然后将临时列表附加到＆＃34;项目＆＃34;但不知怎的，它不起作用。

已更新：缩进已修复。

Answer 1

下面我已经做了一个快速模型，告诉我如何推荐这样做，只要你知道每页的TD数量。您可以根据需要采取部分或全部措施。这可能是针对您的问题而过度设计的（对不起！）;你可以只取chunk_by_numbers位并完成....

有几点需要注意：

1）避免使用＆＃39; tuple＆＃39;作为变量名称，因为它也是一个内部关键字

2）学会使用生成器/内置插件，因为如果你一次做大量的网站，它们会更快更轻（参见下面的parse_to_kv和chunk_by_number）

3）尝试隔离解析逻辑，如果它发生变化，你可以在一个地方轻松换出（参见下面的extract_td）

4）你的功能不使用＆＃39; self＆＃39;，所以你应该使用@staticmethod装饰器并从函数中删除这个参数

5）目前输出是一个dict，但如果你需要一个JSON对象，你可以导入json并转储它

def extract_td(item, index):
    # extract logic for my websites which allows extraction
    # of either a key or value from a table data
    # returns a string representation of item[index]
    # this is very page/tool specific!
    td_as_str = "td[%i]" % index
    val = item.xpath(td_as_str).extract()[0]
    return html2text.html2text(val).rstrip()

def parse_to_kv(xpaths):
    # returns key, value pairs from the given
    # this is also page specific
    for xpath in xpaths:
        yield extract_td(xpath, 0), extract_td(xpath, 1)

def chunk_by_number(alist, num):
    # splices alist into chunks of num size.
    # This is a very generic, reusable operation
    for chunk in list(zip(*(iter(alist),) * num)):
        yield chunk

def parse(response, td_per_page):
    # extracts key/value pairs based on the table datas in response
    # yields lists of length td_per_page which contain these key/value extractions
    # this is very specific based on our parse patterns
    sel = Selector(response)
    tuples = sel.xpath('//*[td[@class = "caption"]]')
    kv_generator = parse_to_kv(tuples)

    for page in chunk_by_number(kv_generator, td_per_page):
        print dict(page)

如何使用Scrapy获得结构化的JSON输出？

1 个答案: