Question

我正在尝试解析其description字段中包含HTML的XML文件中的某些数据。

例如，数据如下：

<xml>
    <description>
        <body>
           HTML I want
        </body>
    </description
    <description>
        <body>
           - more data I want -
        </body>
    </description>
</xml>

到目前为止，我提出的是：

来自bs4 import BeautifulSoup

soup = BeautifulSoup(myfile, 'html.parser')
descContent = soup.find_all('description')
for i in descContent:
    bodies = i.find_all('body')
    # This will return an object of type 'ResultSet'
    for n in bodies:
        print n
        # Nothing prints here.

我不确定我哪里出错了;当我列举descContent中的条目时，它会显示我正在寻找的内容;棘手的部分是进入<body>的嵌套条目。谢谢你的期待！

编辑：经过进一步的游戏，似乎BeautifulSoup没有认识到<description>标签中有HTML - 它只是文本，因此出现了问题。我想将结果保存为HTML文件并重新解析，但不确定是否可行，因为保存包含所有回车符和新行的文字字符串...

Answer 1

在lxml中使用xml解析器你可以用
安装lxml解析器 pip install lxml

with open("file.html") as fp:
    soup = BeautifulSoup(fp, 'xml')

for description in soup.find_all('description'):
    for body in description.find_all('body'):
        print body.text.replace('-', '').replace('\n', '').lstrip(' ')

或者你可以输入

print body.text

解析嵌套在XML文件中的HTML（使用BeautifulSoup）

1 个答案: