Question

我是Python的新手。我设法整理了一个擦除网页的脚本（下面的响应示例），然后将数据以JSON格式转储到文件中。

响应中有多个Item元素，我想要每个元素都有对象。这很好用，文件中的每个json对象都有一个guid和一个标题。但是每个Item中有几个类别元素，我无法弄清楚如何将这些元素添加到输出中。我可以遍历类别元素并打印它们，但不能将它们附加到输出。

我得到的回应是以下结构：

<channel>
    <title>XXX</title>
    ...
    <item>
        <title>XX</title>
        <description>XX</description>
        <category>AAA</category>
        ...
        <category>DDD</category>
        <guid>XX</guid>
    </item>
        ...
    <item>
        …
    </item>
    …
</channel>

这是代码：

import urllib
import json
from bs4 import BeautifulSoup

webPage = urllib.urlopen('XXX')
soup = BeautifulSoup(webPage.read())

items = soup.find_all('item')
output = []

for item in items:  
    for c in item.findAll('category'):
        print c # each category prints out but how to add this to output?
    output.append({
    "guid":  (item.find("guid").contents[0]).encode('utf-8'),
    "title": (item.find("title").contents[0]).encode('utf-8'),

    #"category":  item.findAll('category')
    })

with open("jsonOutput.json", 'w') as jsonFile:
    json.dump(output, jsonFile, sort_keys = True, indent = 4, ensure_ascii=False)
jsonFile.close()

非常感谢你寻找!!!

Answer 1

我的美丽知识可能有点生疏。

您想要附加包含字符串列表的类别节点，例如：

"guid": ["category_1","category_2",...,"category_n"]

这可以通过以下方式完成：

for item in items:
    categories = [c.contents[0].encode('utf-8') for c in item.findAll('category')]
    output.append({
    "guid":  (item.find("guid").contents[0]).encode('utf-8'),
    "title": (item.find("title").contents[0]).encode('utf-8'),
    "category": categories,
    })

将输出：

[
    {
        "category": [
            "AAA", 
            "DDD"
        ], 
        "guid": "XX", 
        "title": "XX"
    }
]

具有多个值的Python JSON数据转储

1 个答案: