使用Python中的Beautiful Soup添加缺少的子标签

时间:2017-08-11 12:40:18

标签: python xml python-2.7 python-3.x beautifulsoup

我需要从下面的XML中提取人名。我使用了下面的代码并得到了输出(ORIGINAL)。但我甚至想要丢失的物品。

第四个人的中间名丢失了,所以只提取了三个名字。

示例XML:

<author>
    <persName>
        <forename>Esayas</forename>
         <middlename>K</middlename>
         <surname>Gudina</surname>
    <lb/>
         <marker>1*</marker>,
    </persName>
    <persName>
         <forename>Solomon</forename>
         <middlename>T</middlename>
         <surname>Amade</surname>
    <lb/>
         <marker>2</marker> ,
    </persName>
    <persName>
        <forename>Fessahaye</forename>
         <middlename>A</middlename>
         <surname>Tesfamichael</surname>
    <lb/>
         <marker>3</marker> and
    </persName>
    <persName>
         <forename>Rana</forename>
        <surname>Ram</surname>
    <lb/>
         <marker>4</marker>
    </persName>
</author>

代码:

from bs4 import BeautifulSoup as bs
import codecs

name = []

with codecs.open("D:/...../2F1472-6823-11-19.authors.tei.xml", "r", "utf-8") as infile:
    soup = bs(infile, "html5lib")      

pn = soup.find_all('persname')


for i in pn:
    try:
        if len((i.find('forename')).text) != 0:
            fn = (i.find('forename')).text
        else:
            fn =""
        if len((i.find('middlename')).text) != 0:
            mn = (i.find('middlename')).text
        else:
            mn=""
        if len((i.find('surname')).text) != 0:
            sn = (i.find('surname')).text
        else:
            sn ="" 
        name.append(fn+" "+mn+" "+sn)
    except:
        print ("")

输出:

INDEX   TYPE       SIZE   VALUE
0     unicode       1     Esayas K Gudina
1     unicode       1     Solomon T Amade
2     unicode       1     Fessahaye A Tesfamichael

预期产出:

INDEX   TYPE       SIZE   VALUE
0     unicode       1     Esayas K Gudina
1     unicode       1     Solomon T Amade
2     unicode       1     Fessahaye A Tesfamichael
3     unicode       1     Rana   Ram

0 个答案:

没有答案