我需要从下面的XML中提取人名。我使用了下面的代码并得到了输出(ORIGINAL)。但我甚至想要丢失的物品。
第四个人的中间名丢失了,所以只提取了三个名字。
示例XML:
<author>
<persName>
<forename>Esayas</forename>
<middlename>K</middlename>
<surname>Gudina</surname>
<lb/>
<marker>1*</marker>,
</persName>
<persName>
<forename>Solomon</forename>
<middlename>T</middlename>
<surname>Amade</surname>
<lb/>
<marker>2</marker> ,
</persName>
<persName>
<forename>Fessahaye</forename>
<middlename>A</middlename>
<surname>Tesfamichael</surname>
<lb/>
<marker>3</marker> and
</persName>
<persName>
<forename>Rana</forename>
<surname>Ram</surname>
<lb/>
<marker>4</marker>
</persName>
</author>
代码:
from bs4 import BeautifulSoup as bs
import codecs
name = []
with codecs.open("D:/...../2F1472-6823-11-19.authors.tei.xml", "r", "utf-8") as infile:
soup = bs(infile, "html5lib")
pn = soup.find_all('persname')
for i in pn:
try:
if len((i.find('forename')).text) != 0:
fn = (i.find('forename')).text
else:
fn =""
if len((i.find('middlename')).text) != 0:
mn = (i.find('middlename')).text
else:
mn=""
if len((i.find('surname')).text) != 0:
sn = (i.find('surname')).text
else:
sn =""
name.append(fn+" "+mn+" "+sn)
except:
print ("")
输出:
INDEX TYPE SIZE VALUE
0 unicode 1 Esayas K Gudina
1 unicode 1 Solomon T Amade
2 unicode 1 Fessahaye A Tesfamichael
预期产出:
INDEX TYPE SIZE VALUE
0 unicode 1 Esayas K Gudina
1 unicode 1 Solomon T Amade
2 unicode 1 Fessahaye A Tesfamichael
3 unicode 1 Rana Ram