Question

我有一个格式不正确的XML，因为在尝试读取XML时出现此错误：

import xml.etree.ElementTree as ET
ET.parse(r'my.xml')

我收到以下错误

ParseError：格式不正确（令牌无效）：第2034行，第317列

因此，我使用BeautifulSoup通过以下代码读取xml：

from bs4 import BeautifulSoup

with open(r'my.xml') as fp:
    soup = BeautifulSoup(fp, 'xml')

如果我打印soup，它看起来像这样：

        <Placemark> 
<name>India </name> 
    <description>Country</description> 
    <styleUrl>#icon-962-B29189</styleUrl> 
    </Placemark>
        <Placemark> 
<name>USA</name>   
    <styleUrl>#icon-962-B29189</styleUrl> 
    </Placemark>            
    <Placemark>   
    <description>City</description> 
    <styleUrl>#icon-962-B29189</styleUrl> 
    </Placemark>

我总共有100多个Placemark标签及其中的信息。我想捕获每个标签的name和description并使用相应的列制作一个df。

我的代码是：

name_tag=[x.text.strip() for x in soup.findAll('name')]
description_tag =[x.text.strip() for x in soup.findAll('description')]

问题是我根本没有Placemark或name标签的某些description标签。因此，我不知道哪个名字有什么描述。因此，由于缺少标签，因此名称和描述之间不匹配。

预期的输出数据框：

Name      Description
India     Country
USA
           City

他们有什么办法可以实现相同目标吗？

Answer 1

由于分别搜索name和description标签，因此您将失去对哪个名称属于哪个描述的了解。

相反，您应该自己解析每个placemark标签，并处理每个地标标签缺少name和description标签的情况。

data = []

for placemark in soup.findAll('placemark'):
    try:
        name = placemark.find('name').text.strip()
    except AttributeError:
        name = None
    try:
        description = placemark.find('description').text.strip()
    except AttributeError:
        description = None

    data.append((name, description))

df = pd.DataFrame(data, columns=['Name', 'Description'])
print(df)
#       Name    Description
#  0   India        Country
#  1     USA           None
#  2    None           City

从格式不正确的XML中获取列名

1 个答案: