将以下xml元素转换为pandas数据框时出现问题?

时间:2018-08-20 00:47:39

标签: python python-3.x pandas beautifulsoup lxml

我正在使用漂亮的汤来解析和从一堆xml文件中提取一些信息,如下所示:

import os
a_lis = []
for filepath in glob(os.path.join('../data/trainingFiles/', '*.xml')):
    with open(filepath) as f:
        content = f.read()
        results = BeautifulSoup(content, 'lxml')
        #print(results)
        for LabelInteractions in results.find_all("labelinteractions"):
            #print(LabelInteractions)
            for labelinteractions in LabelInteractions.findAll('labelinteraction'):
                print(labelinteractions)

退出:

<labelinteraction precipitant="ritonavir" precipitantcode="N0000007423" type="Unspecified interaction"></labelinteraction>
<labelinteraction precipitant="gc stimulator" precipitantcode="NO MAP" type="Unspecified interaction"></labelinteraction>
....
<labelinteraction precipitant="riociguat" precipitantcode="N0000188995" type="Unspecified interaction"></labelinteraction>
<labelinteraction effect=" 25064002: Headache (finding)" precipitant="alcohol" precipitantcode="N0000007432" type="Pharmacodynamic interaction"></labelinteraction>

如何将这些xml属性转换为pandas数据框格式?,这些列看起来像这样:

precipitant  precipitantcode type effect

2 个答案:

答案 0 :(得分:2)

您可以将列存储在数组中,然后创建数据框:

from collections import defaultdict

from bs4 import BeautifulSoup
import pandas as pd

soup = BeautifulSoup("""
<labelinteraction precipitant="ritonavir" precipitantcode="N0000007423" type="Unspecified interaction"></labelinteraction>
<labelinteraction precipitant="gc stimulator" precipitantcode="NO MAP" type="Unspecified interaction"></labelinteraction>
<LabelInteraction type="Pharmacodynamic interaction" precipitant="alcohol" precipitantCode="N0000007432" effect=" 25064002: Headache (finding)"/>
""") 

columns = ['precipitant', 'precipitantcode', 'type', 'effect']
d = defaultdict(list)

for labelinteraction in soup.findAll('labelinteraction'):
    for col in columns:
        d[col].append(labelinteraction[col] if labelinteraction.has_attr(col) else None)

df = pd.DataFrame(d)

输出:

     precipitant precipitantcode                         type                         effect
0      ritonavir     N0000007423      Unspecified interaction                           None
1  gc stimulator          NO MAP      Unspecified interaction                           None
2        alcohol     N0000007432  Pharmacodynamic interaction   25064002: Headache (finding)

答案 1 :(得分:1)

如果您有想要的列列表:

cols = ['precipitant', 'precipitantcode', 'type']

然后您可以遍历它们并将其追加到字典中的数组:

d = {}
for labelinteractions in LabelInteractions.findAll('labelinteraction'):
    for c in cols:
        if not c in d:
            d[c] = [labelinteractions[c]]
        else:
            d[c].append(labelinteractions[c])

完成后,您可以请求DataFrame:

df = pd.DataFrame(d)

这是我为您提供的样品:

     precipitant precipitantcode                         type
0      ritonavir     N0000007423      Unspecified interaction
1  gc stimulator          NO MAP      Unspecified interaction
2      riociguat     N0000188995      Unspecified interaction
3        alcohol     N0000007432  Pharmacodynamic interaction