我一直在尝试解析XML文件(我在下面粘贴了最小的可复制代码),并为每个雇主代码获取所有收入标签的Year和income字段的值。请查看以下输出,以便您更好地了解我
我想要得到的东西:
{"1234": [["2006", "12085"], ["2005","23071"], ["2004","21364"]],
"5678" : [["2015", "12345"],["2014", "13071"]]}
我一直在尝试使用elementtree和/或beautifulsoup通过多种方式最终干预该文件:
我实际上得到了什么:
[["2006", "12085"], ["2005","23071"], ["2004",["2015", "12345"],["2014", "13071"]]
我无法根据他们的雇主代码对他们进行分组。
PS:我是在Stackoverflow上发布问题的新手。我希望我符合所有社区准则。 这是XML困扰着我:
<DETAILS>
<RESPONSE>
<EMPLOYER>
<EMPLOYERCODE>1234</EMPLOYERCODE>
<NAME1>ABC Service Corporation</NAME1>
</EMPLOYER>
<INCOME>
<YEAR>2006</YEAR>
<TOTAL>12085</TOTAL>
</INCOME>
<INCOME>
<YEAR>2005</YEAR>
<TOTAL>23071</TOTAL>
</INCOME>
<INCOME>
<YEAR>2004</YEAR>
<TOTAL>21364</TOTAL>
</INCOME>
<ID>18700763721</ID>
</RESPONSE>
<RESPONSE>
<EMPLOYER>
<EMPLOYERCODE>5678</EMPLOYERCODE>
<NAME1>DEF Service Corporation</NAME1>
</EMPLOYER>
<INCOME>
<YEAR>2015</YEAR>
<TOTAL>12345</TOTAL>
</INCOME>
<INCOME>
<YEAR>2014</YEAR>
<TOTAL>13071.73</TOTAL>
</INCOME>
<ID>18700763721</ID>
</RESPONSE>
</DETAILS>
答案 0 :(得分:1)
首先迭代答复,因为它们既包含雇主代码,也包含损益表。然后,它只是将雇主与他们的收入联系起来。
xml = '''
<DETAILS>
<RESPONSE>
<EMPLOYER>
<EMPLOYERCODE>1234</EMPLOYERCODE>
<NAME1>ABC Service Corporation</NAME1>
</EMPLOYER>
<INCOME>
<YEAR>2006</YEAR>
<TOTAL>12085</TOTAL>
</INCOME>
...
</RESPONSE>
<RESPONSE>
...
</RESPONSE>
</DETAILS>
'''
soup = BeautifulSoup(xml, 'html.parser')
employers = {}
for res in soup.select('response'):
emp_code = res.select_one('employercode').text
incomes = []
for income in res.select('income'):
year = income.select_one('year').text
total = income.select_one('total').text
incomes.append([year, total])
employers[emp_code] = incomes
print(employers)
输出:
{'1234': [['2006', '12085'], ['2005', '23071'], ['2004', '21364']], '5678': [['2015', '12345'], ['2014', '13071.73']]}
答案 1 :(得分:0)
这是此问题的elementtree版本
import xml.etree.ElementTree as ET
tree = ET.parse('_filename_.xml')
root = tree.getroot()
dic ={}
for child in root:
for schild in child:
if schild.tag=='EMPLOYER':
emp=schild[0].text
dic[emp]=[]
if schild.tag=='INCOME':
arr=[]
arr.append(schild[0].text)
arr.append(schild[1].text)
if emp not in dic:
dic[emp]=arr
else:
dic[emp].append(arr)
print(dic)
输出:
{'1234': [['2006', '12085'], ['2005', '23071'], ['2004', '21364']], '5678': [['2015', '12345'], ['2014', '13071.73']]}
答案 2 :(得分:0)
您可以利用dict理解和BeautifulSoup' get_text()
method and then split
`数据。结果是很短的代码:
data = '''<DETAILS>
... your data ...
</DETAILS>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
data = {response.select_one('employercode').text: [i.get_text(strip=True, separator='|').split('|') for i in response.select('income')] for response in soup.select('response')}
from pprint import pprint
pprint(data)
打印:
{'1234': [['2006', '12085'], ['2005', '23071'], ['2004', '21364']],
'5678': [['2015', '12345'], ['2014', '13071.73']]}