在python中使用Regex解析xml文件

时间:2015-03-16 12:45:12

标签: python regex xml xml-parsing

我有一个xml文件作为文本文件,如下所示: -

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2011-01-11" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>http://dx.doi.org/10.1007/BF03036466</ee>
</article>
<article mdate="2011-01-11" key="journals/acta/Simon83">
<author>Hans-Ulrich Simon</author>
<title>Pattern Matching in Trees and Nets.</title>
<pages>227-248</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#Simon83</url>
<ee>http://dx.doi.org/10.1007/BF01257084</ee>
</article>

如果我输入'Parallel',那么我应该获得整个标题名称,然后是'author','pages','year','volume','journal'

作为样本输出: -

Sanjeev Saxena
Parallel Integer Sorting and Simulation Amongst CRCW Models.
607-619
1996
33
Acta Inf.

如何使用正则表达式执行上述操作?请帮忙!

提前致谢!

2 个答案:

答案 0 :(得分:1)

解析xmlhtml doc的最佳方法是使用正确的html解析器,如beautifulsouplxml模块,但作为替代方法,您可以使用以下模式:

>>> s="""<?xml version="1.0" encoding="ISO-8859-1"?>
... <!DOCTYPE dblp SYSTEM "dblp.dtd">
... <dblp>
... <article mdate="2011-01-11" key="journals/acta/Saxena96">
... <author>Sanjeev Saxena</author>
... <title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
... <pages>607-619</pages>
... <year>1996</year>
... <volume>33</volume>
... <journal>Acta Inf.</journal>
... <number>7</number>
... <url>db/journals/acta/acta33.html#Saxena96</url>
... <ee>http://dx.doi.org/10.1007/BF03036466</ee>
... </article>
... <article mdate="2011-01-11" key="journals/acta/Simon83">
... <author>Hans-Ulrich Simon</author>
... <title>Pattern Matching in Trees and Nets.</title>
... <pages>227-248</pages>
... <year>1983</year>
... <volume>20</volume>
... <journal>Acta Inf.</journal>
... <url>db/journals/acta/acta20.html#Simon83</url>
... <ee>http://dx.doi.org/10.1007/BF01257084</ee>
... </article>"""
>>> import re
>>> l=['author','pages','year','volume','journal']
>>> pat=r'|'.join(('<{}>(.*)</{}>'.format(i,i) for i in l))
>>> [j  for i in re.findall(pat,s) for j in i if j]
['Sanjeev Saxena', '607-619', '1996', '33', 'Acta Inf.', 'Hans-Ulrich Simon', '227-248', '1983', '20', 'Acta Inf.']

如果您想从输入中获取单词,则需要以下额外命令:

names=raw_input('enter the named (separate with space): ')
l=names.split()

答案 1 :(得分:0)

改为使用 XML Parser

使用lxml的工作示例:

import lxml.etree as ET

data = """<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
        <article mdate="2011-01-11" key="journals/acta/Saxena96">
                <author>Sanjeev Saxena</author>
                <title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
                <pages>607-619</pages>
                <year>1996</year>
                <volume>33</volume>
                <journal>Acta Inf.</journal>
                <number>7</number>
                <url>db/journals/acta/acta33.html#Saxena96</url>
                <ee>http://dx.doi.org/10.1007/BF03036466</ee>
                </article>
                <article mdate="2011-01-11" key="journals/acta/Simon83">
                <author>Hans-Ulrich Simon</author>
                <title>Pattern Matching in Trees and Nets.</title>
                <pages>227-248</pages>
                <year>1983</year>
                <volume>20</volume>
                <journal>Acta Inf.</journal>
                <url>db/journals/acta/acta20.html#Simon83</url>
                <ee>http://dx.doi.org/10.1007/BF01257084</ee>
        </article>
</dblp>
"""

root = ET.fromstring(data)

title = 'Parallel'
article = root.xpath('.//article[starts-with(title, "%s")]' % title)[0]

for prop in ['author', 'pages', 'year', 'volume', 'journal']:
    print article.findtext(prop)

打印:

Sanjeev Saxena
607-619
1996
33
Acta Inf.