使用Beautiful Soup用XML标签提取内容

时间:2017-08-08 10:12:54

标签: python xml python-2.7 beautifulsoup python-requests

我有几个xml(s)如下。我想使用Python中的Beautiful Soup按照下面的预期输出从xml中提取内容(作为数据帧)。请帮助我。

示例XML:

<Author AffiliationIDS="Aff1 Aff2" CorrespondingAffiliationID="Aff1" ORCID="http://orcid.org/0000-0003-4649-327X">
    <AuthorName DisplayOrder="Western">
        <GivenName>Anouk</GivenName>
        <GivenName>van der</GivenName>
        <FamilyName>Hoorn</FamilyName>
    </AuthorName>
    <Contact>
        <Phone>+31-50-3612400</Phone>
        <Fax>+31-50-3611707</Fax>
        <Email>a.van.der.hoorn@umcg.nl</Email>
    </Contact>
</Author>
<Author AffiliationIDS="Aff1">
 <AuthorName DisplayOrder="Western">
    <GivenName>Kamal</GivenName>
    <GivenName>M.</GivenName>
    <FamilyName>Aden</FamilyName>
 </AuthorName>
</Author>
<Author AffiliationIDS="Aff1 Aff2">
 <AuthorName DisplayOrder="Western">
    <GivenName>Peter</GivenName>
    <GivenName>Jan</GivenName>
    <FamilyName>van Laar</FamilyName>
 </AuthorName>
</Author>

预期输出:

Anouk van der Hoorn         AuthorName
Kamal M. Aden               AuthorName
Peter Jan var Laar          AuthorName 

1 个答案:

答案 0 :(得分:1)

这里是代码,只是几行:

from bs4 import BeautifulSoup as b
with open("sample.xml", "r") as f: # opening xml file
    content = f.read()
soup = b(content, "lxml")
authornames = ([values.find("authorname").text.replace("\n", ' ') for values in soup.findAll("author")])
print authornames

输出:

[u' Anouk van der Hoorn ', u' Kamal M. Aden ', u' Peter Jan van Laar ']