Question

我有一堆XML文件（大约74k），他们有这种结构：

<?xml version="1.0" encoding="UTF-8"?><article pmcid="2653499" pmid="19243591" doi="10.1186/1472-6963-9-38">
<title>Systematic review</title>
<fulltext>...</fulltext>
<figures>
<figure iri="1472-6963-9-38-2"><caption>...</caption></figure>
<figure iri="1472-6963-9-38-1"><caption>...</caption></figure>
</figures>
</article>

我想将pmcid参数（每个文件唯一）与它们包含在列表中的数字的iri参数联系起来，这样我就可以用它们构建一个numpy数组甚至一个易于使用的文件

例如，对于本文，该行应为：

2653499 1472-6963-9-38-2 1472-6963-9-38-1

我尝试使用XSLT没有任何结果......我将不胜感激任何帮助。

Answer 1

以下是使用标准库中的xml.etree.ElementTree的选项：

import xml.etree.ElementTree as ET

data = """<?xml version="1.0" encoding="UTF-8"?>
<article pmcid="2653499" pmid="19243591" doi="10.1186/1472-6963-9-38">
    <title>Systematic review</title>
    <fulltext>...</fulltext>
    <figures>
        <figure iri="1472-6963-9-38-2"><caption>...</caption></figure>
        <figure iri="1472-6963-9-38-1"><caption>...</caption></figure>
    </figures>
</article>
"""

article = ET.fromstring(data)

pmcid = article.attrib.get('pmcid')
for figure in article.findall('figures/figure'):
    iri = figure.attrib.get('iri')
    print pmcid, iri

打印：

2653499 1472-6963-9-38-2
2653499 1472-6963-9-38-1

Answer 2

使用Beautifulsoup怎么样？

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('file.xml'))

pmcid = soup.find('article')['pmcid']
figure = soup.findAll('figure')

print pmcid,

for i in figure:
    print i['iri'],

完全按照你的例子打印。

2653499 1472-6963-9-38-2 1472-6963-9-38-1

Answer 3

out.xsl：

<!-- http://www.w3.org/TR/xslt#copying -->
<!-- http://www.dpawson.co.uk/xsl/sect2/identity.html#d5917e43 -->
<!-- The Identity Transformation -->
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text" version="1.0" encoding="UTF-8"/>

    <!-- Whenever you match any node or any attribute -->
    <xsl:template match="@*|node()">
        <!-- Copy the current node -->
        <xsl:copy>
            <!-- Including any attributes it has and any child nodes -->
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="article">
        <xsl:value-of select="@pmcid"/>
        <xsl:apply-templates select="figures/figure"/>
        <xsl:text>
</xsl:text>
    </xsl:template>

    <xsl:template match="figure">
        <xsl:text> </xsl:text><xsl:value-of select="@iri"/>
    </xsl:template>
</xsl:stylesheet>

执行命令

$ xsltproc out.xsl in.xml
2653499 1472-6963-9-38-2 1472-6963-9-38-1

Answer 4

您可以尝试xmllint。

xmllint --shell myxml <<< `echo 'cat /article/@pmcid|//figures/figure/@*'`
/ >  -------
 pmcid="2653499"
 -------
 iri="1472-6963-9-38-2"
 -------
 iri="1472-6963-9-38-1"
/ >

然后管道到awk以获得所需的输出......

xmllint --shell myxml <<< `echo 'cat /article/@pmcid|//figures/figure/@*'` | 
awk -F'[="]' -v ORS=" " 'NF>1{print $3}'
2653499 1472-6963-9-38-2 1472-6963-9-38-1

Answer 5

（A）

好吧，因为你说过任何帮助......这是我的镜头 -

根据我的经验，你会更加满意

obj.__dict__

并了解每个xml元素的拟合程度。这样，您可以通过传递迭代测试（以下）有效地拼写检查整个xml文件

我把你的示例数据放在.xml文件中，然后用Python IDE（2.7.xxx）加载它。以下是我如何制作使用的代码：

import xml.etree.ElementTree as ET
>>> some_tree = ET.parse("/Users/pro/Desktop/tech/test_scripts/test.xml")
>>> for block_number in range(0, len(some_tree._root.getchildren())):
    print "block_number: " + str(block_number)


block_number: 0
block_number: 1
block_number: 2
>>> some_tree._root.getchildren()
[<Element 'title' at 0x101a59450>, <Element 'fulltext' at 0x101a59550>, <Element 'figures' at 0x101a59410>]
>>> some_tree._root.__dict__
{'text': '\n', 'attrib': {'pmid': '19243591', 'doi': '10.1186/1472-6963-9-38', 'pmcid': '2653499'}, 'tag': 'article', '_children': [<Element 'title' at 0x101a59450>, <Element 'fulltext' at 0x101a59550>, <Element 'figures' at 0x101a59410>]}
>>> some_tree._root.attrib
{'pmid': '19243591', 'doi': '10.1186/1472-6963-9-38', 'pmcid': '2653499'}
>>> some_tree._root.attrib['pmid']
'19243591'
>>> to_store = {}
>>> to_store[some_tree._root.attrib['pmid']] = []
>>> some_tree._root.getchildren()
[<Element 'title' at 0x101a59450>, <Element 'fulltext' at 0x101a59550>, <Element 'figures' at 0x101a59410>]
>>> some_tree._root[2]
<Element 'figures' at 0x101a59410>
>>> some_tree._root[2].__dict__
{'text': '\n', 'attrib': {}, 'tag': 'figures', 'tail': '\n', '_children': [<Element 'figure' at 0x101a595d0>, <Element 'figure' at 0x101a59650>]}
>>> some_tree._root[2].getchildren()
[<Element 'figure' at 0x101a595d0>, <Element 'figure' at 0x101a59650>]
>>> for r in range(0, len(some_tree._root[2].getchildren())):
    print some_tree._root[2].getchildren()[r]


<Element 'figure' at 0x101a595d0>
<Element 'figure' at 0x101a59650>
>>> some_tree._root[2].getchildren()[1].__dict__
{'attrib': {'iri': '1472-6963-9-38-1'}, 'tag': 'figure', 'tail': '\n', '_children': [<Element 'caption' at 0x101a59690>]}
>>> for r in range(0, len(some_tree._root[2].getchildren())):
    to_store[to_store.keys()[0]].append(some_tree._root[2].getchildren()[r].attrib['iri'])


>>> to_store
{'19243591': ['1472-6963-9-38-2', '1472-6963-9-38-1']}
>>>

请注意，to_store是任意的，只是方便您存储那些x，y数据。

B）

我真的很喜欢输出到我自己的sqlite平面文件db。我这样做是为了翻译整本圣经，以便在我发布的iOS应用程序中运行时使用。这是sql的一些示例代码：

import sqlite3
bible_books = ["genesis", "exodus", "leviticus", "numbers", "deuteronomy",
           "joshua", "judges", "ruth", "1 samuel", "2 samuel", "1 kings",
           "2 kings", "1 chronicles", "2 chronicles", "ezra", "nehemiah",
           "esther", "job", "psalms", "proverbs", "ecclesiastes",
           "song of solomon", "isaiah", "jeremiah", "lamentations",
           "ezekiel", "daniel", "hosea", "joel", "amos", "obadiah",
           "jonah", "micah", "nahum", "habakkuk", "zephaniah", "haggai",
           "zechariah", "malachi", "matthew", "mark", "luke", "john",
           "acts", "romans", "1 corinthians", "2 corinthians",
           "galatians", "ephesians", "philippians", "colossians",
           "1 thessalonians", "2 thessalonians", "1 timothy",
           "2 timothy", "titus", "philemon", "hebrews", "james",
           "1 peter", "2 peter", "1 john", "2 john", "3 john",
           "jude", "revelation"]
chapter_counts = {bible_books[0]:50, bible_books[1]:40, bible_books[2]:27,
          bible_books[3]:36, bible_books[4]:34, bible_books[5]:24,
          bible_books[6]:21, bible_books[7]:4, bible_books[8]:31,
          bible_books[9]:24, bible_books[10]:22, bible_books[11]:25,
          bible_books[12]:29, bible_books[13]:36, bible_books[14]:10,
          bible_books[15]:13, bible_books[16]:10, bible_books[17]:42,
          bible_books[18]:150, bible_books[19]:31, bible_books[20]:12,
          bible_books[21]:8, bible_books[22]:66, bible_books[23]:52,
          bible_books[24]:5, bible_books[25]:48, bible_books[26]:12,
          bible_books[27]:14, bible_books[28]:3, bible_books[29]:9,
          bible_books[30]:1, bible_books[31]:4, bible_books[32]:7,
          bible_books[33]:3, bible_books[34]:3,
          bible_books[35]:3, bible_books[36]:2, bible_books[37]:14,
          bible_books[38]:4, bible_books[39]:28, bible_books[40]:16,
          bible_books[41]:24, bible_books[42]:21, bible_books[43]:28,
          bible_books[44]:16, bible_books[45]:16, bible_books[46]:13,
          bible_books[47]:6, bible_books[48]:6, bible_books[49]:4,
          bible_books[50]:4, bible_books[51]:5, bible_books[52]:3,
          bible_books[53]:6, bible_books[54]:4, bible_books[55]:3,
          bible_books[56]:1, bible_books[57]:13, bible_books[58]:5,
          bible_books[59]:5, bible_books[60]:3, bible_books[61]:5,
          bible_books[62]:1, bible_books[63]:1, bible_books[64]:1,
          bible_books[65]:22}

conn = sqlite3.connect("bible_web.sqlite3")
c = conn.cursor()



for i_book in bible_books:
    book_name = "b_" + i_book.lower().replace(" ", "_")
    for i_chapter in range(1, chapter_counts[i_book]+1):
        c.execute("create table " + book_name + "_" + str(i_chapter) + " (verse real primary key, value text)")

for i_book in bible_books:
    book_name = "b_" + i_book.lower().replace(" ", "_")
    for i_chapter in range(1, chapter_counts[i_book]+1):
        #c.execute("SELECT Count(*) FROM " + book_name + "_" + str(i_chapter))
        #i_rows = int(c.fetchall())
        #for verse_number in range(1, i_rows+1):
        c.execute("update " + book_name + "_" + str(i_chapter) + " set value=trim(value)")

conn.commit()
c.close()
conn.close()

只是一些想法。希望有所帮助。

如何提取XML特定值字段并列出它们？

5 个答案: