将XML文件重新格式化为没有标记的文本文件

时间:2015-03-20 05:38:06

标签: python regex xml

我有一个文本文件: -

<author>Frank Drewes</author>
<author>Johanna H&ouml;gberg</author>
<author>Andreas Maletti</author>
<title>MAT learners for tree series: an abstract data type and two realizations.</title>
<pages>165-189</pages>
<year>2011</year>
<volume>48</volume>
</article>

我需要删除其中的所有角括号: - 并在角括号内提供括号和括号连接的名称; -

author-Frank Drewes
author-Johanna H&ouml;gberg
author-Andreas Maletti
title-MAT learners for tree series: an abstract data type and two realizations.
pages-165-189
year-2011
volume-48

3 个答案:

答案 0 :(得分:2)

相反,潜入正则表达式的精彩世界,我会使用特定工具 - 解析器,如lxml

工作示例:

from lxml.html import fromstring

data = u"""
<article>
  <author>Frank Drewes</author>
  <author>Johanna H&ouml;gberg</author>
  <author>Andreas Maletti</author>
  <title>MAT learners for tree series: an abstract data type and two realizations.</title>
  <pages>165-189</pages>
  <year>2011</year>
  <volume>48</volume>
</article>
"""

root = fromstring(data)

for element in root.iterchildren():
    print '%s-%s' % (element.tag, element.text_content())

打印:

author-Frank Drewes
author-Johanna Högberg
author-Andreas Maletti
title-MAT learners for tree series: an abstract data type and two realizations.
pages-165-189
year-2011
volume-48

答案 1 :(得分:1)

拜托,请尽可能用alexce的方法。 (如果没有,试着找到一种方法使其成为可能 - see this answer for rationale)。我只是把它扔在这里以供变化。

使用re.match,命名组和反向引用。

import re

input_lines = '''<author>Frank Drewes</author>
<author>Johanna H&ouml;gberg</author>
<author>Andreas Maletti</author>
<title>MAT learners for tree series: an abstract data type and two realizations.</title>
<pages>165-189</pages>
<year>2011</year>
<volume>48</volume>'''.splitlines()

out_lines = []
for line in input_lines:
    mat = re.match(r'<(?P<tag>[^>]+)>([^>]*)</(?P=tag)>', line)
    if mat: out_lines.append("%s-%s" % mat.groups())

print '\n'.join(out_lines)

输出:

author-Frank Drewes
author-Johanna Högberg
author-Andreas Maletti
title-MAT learners for tree series: an abstract data type and two realizations.
pages-165-189
year-2011
volume-48

答案 2 :(得分:0)

您可以尝试使用以下re.sub命令,但它不会处理嵌套标记。

>>> import re
>>> s = '''<author>Frank Drewes</author>
<author>Johanna H&ouml;gberg</author>
<author>Andreas Maletti</author>
<title>MAT learners for tree series: an abstract data type and two realizations.</title>
<pages>165-189</pages>
<year>2011</year>
<volume>48</volume>
</article>'''
>>> m = re.sub(r'<(\w+)\b[^>]*>([^<]*)</\1>', r'\1-\2', s)
>>> print(re.sub(r'<[^<>]*>', '', m))
author-Frank Drewes
author-Johanna H&ouml;gberg
author-Andreas Maletti
title-MAT learners for tree series: an abstract data type and two realizations.
pages-165-189
year-2011
volume-48