我正在使用漂亮的汤来解析和从一堆xml文件中提取一些信息,如下所示:
import os
a_lis = []
for filepath in glob(os.path.join('../data/trainingFiles/', '*.xml')):
with open(filepath) as f:
content = f.read()
results = BeautifulSoup(content, 'lxml')
#print(results)
for LabelInteractions in results.find_all("labelinteractions"):
#print(LabelInteractions)
for labelinteractions in LabelInteractions.findAll('labelinteraction'):
print(labelinteractions)
退出:
<labelinteraction precipitant="ritonavir" precipitantcode="N0000007423" type="Unspecified interaction"></labelinteraction>
<labelinteraction precipitant="gc stimulator" precipitantcode="NO MAP" type="Unspecified interaction"></labelinteraction>
....
<labelinteraction precipitant="riociguat" precipitantcode="N0000188995" type="Unspecified interaction"></labelinteraction>
<labelinteraction effect=" 25064002: Headache (finding)" precipitant="alcohol" precipitantcode="N0000007432" type="Pharmacodynamic interaction"></labelinteraction>
如何将这些xml属性转换为pandas数据框格式?,这些列看起来像这样:
precipitant precipitantcode type effect
答案 0 :(得分:2)
您可以将列存储在数组中,然后创建数据框:
from collections import defaultdict
from bs4 import BeautifulSoup
import pandas as pd
soup = BeautifulSoup("""
<labelinteraction precipitant="ritonavir" precipitantcode="N0000007423" type="Unspecified interaction"></labelinteraction>
<labelinteraction precipitant="gc stimulator" precipitantcode="NO MAP" type="Unspecified interaction"></labelinteraction>
<LabelInteraction type="Pharmacodynamic interaction" precipitant="alcohol" precipitantCode="N0000007432" effect=" 25064002: Headache (finding)"/>
""")
columns = ['precipitant', 'precipitantcode', 'type', 'effect']
d = defaultdict(list)
for labelinteraction in soup.findAll('labelinteraction'):
for col in columns:
d[col].append(labelinteraction[col] if labelinteraction.has_attr(col) else None)
df = pd.DataFrame(d)
输出:
precipitant precipitantcode type effect
0 ritonavir N0000007423 Unspecified interaction None
1 gc stimulator NO MAP Unspecified interaction None
2 alcohol N0000007432 Pharmacodynamic interaction 25064002: Headache (finding)
答案 1 :(得分:1)
如果您有想要的列列表:
cols = ['precipitant', 'precipitantcode', 'type']
然后您可以遍历它们并将其追加到字典中的数组:
d = {}
for labelinteractions in LabelInteractions.findAll('labelinteraction'):
for c in cols:
if not c in d:
d[c] = [labelinteractions[c]]
else:
d[c].append(labelinteractions[c])
完成后,您可以请求DataFrame:
df = pd.DataFrame(d)
这是我为您提供的样品:
precipitant precipitantcode type
0 ritonavir N0000007423 Unspecified interaction
1 gc stimulator NO MAP Unspecified interaction
2 riociguat N0000188995 Unspecified interaction
3 alcohol N0000007432 Pharmacodynamic interaction