处理XML标签并提取相应的标签内容

时间:2019-05-21 12:07:31

标签: python html xml python-3.x

已处理的XML文件的内容如下:

<dblp>
<incollection>                                                                                                                                                                                                                                                                                                                                                                                                                                            
<author>Philippe Balbiani</author>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
<author>Valentin Goranko</author>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
<author>Ruaan Kellerman</author>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
<author>Dimiter Vakarelov</author>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
<booktitle>Handbook of Spatial Logics</booktitle>                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
</incollection>
<incollection>                                                                                                                                                                                                                                                                                                                                                                                                                                   
<author>Jochen Renz</author>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
<author>Bernhard Nebel</author>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
<booktitle>Handbook of AI</booktitle>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
</incollection>
...
</dblp>

如上所示,格式内容提取遍历每个“ incollection”标签的“ incollection”标签中的“ author”标签内容和“ booktitle”标签内容,并让多个作者标签内容形成“ booktitle”标签内容。对应的关系

我的代码:

soup = BeautifulSoup(str(getfile()), 'lxml')
res = soup.find_all('incollection') 
list = []
list1=[]

for each in res:
    for child in each.children:
          if child.name == 'author':
                list.append(child.text)

          if child.name == 'booktitle':
                list1.append(child.text)           
                elem_dic = tuple(zip(list, list1))

我的结果是:

('Philippe Balbiani', 'Handbook of Spatial Logics')
('Valentin Goranko', 'Handbook of Spatial Logics')
('Ruaan Kellerman', 'Handbook of Spatial Logics')

理想结果如下:

('Philippe Balbiani', 'Handbook of Spatial Logics')
('Valentin Goranko', 'Handbook of Spatial Logics')
('Ruaan Kellerman', 'Handbook of Spatial Logics')
('Dimiter Vakarelov', 'Handbook of Spatial Logics')
('Jochen Renz', 'Handbook of AI')
('Bernhard Nebel', 'Handbook of AI')

如何修改它以获得预期的结果?

1 个答案:

答案 0 :(得分:0)

按如下所示修改您的代码,

soup = BeautifulSoup(str(getfile()), 'lxml')
res = soup.find_all('incollection') 
author = []
booktitle =[]

for each in res:
    for child in each.children:
          if child.name == 'author':
                author.append(child.text)
          elif child.name == 'booktitle': # either it will be 'author' or 'booktitle' so use 'elif'
                booktitle.append(child.text)           
elem_dic = tuple(zip(author, booktitle)) # No need to assign in every loop as you are already storing in lists