Question

我有一个文本文件： -

<author>Frank Drewes</author>
<author>Johanna H&ouml;gberg</author>
<author>Andreas Maletti</author>
<title>MAT learners for tree series: an abstract data type and two realizations.</title>
<pages>165-189</pages>
<year>2011</year>
<volume>48</volume>
</article>

我需要删除其中的所有角括号： - 并在角括号内提供括号和括号连接的名称; -

author-Frank Drewes
author-Johanna H&ouml;gberg
author-Andreas Maletti
title-MAT learners for tree series: an abstract data type and two realizations.
pages-165-189
year-2011
volume-48

Answer 1

相反，潜入正则表达式的精彩世界，我会使用特定工具 - 解析器，如lxml。

工作示例：

from lxml.html import fromstring

data = u"""
<article>
  <author>Frank Drewes</author>
  <author>Johanna H&ouml;gberg</author>
  <author>Andreas Maletti</author>
  <title>MAT learners for tree series: an abstract data type and two realizations.</title>
  <pages>165-189</pages>
  <year>2011</year>
  <volume>48</volume>
</article>
"""

root = fromstring(data)

for element in root.iterchildren():
    print '%s-%s' % (element.tag, element.text_content())

打印：

author-Frank Drewes
author-Johanna Högberg
author-Andreas Maletti
title-MAT learners for tree series: an abstract data type and two realizations.
pages-165-189
year-2011
volume-48

Answer 2

拜托，请尽可能用alexce的方法。（如果没有，试着找到一种方法使其成为可能 - see this answer for rationale）。我只是把它扔在这里以供变化。

使用re.match，命名组和反向引用。

import re

input_lines = '''<author>Frank Drewes</author>
<author>Johanna H&ouml;gberg</author>
<author>Andreas Maletti</author>
<title>MAT learners for tree series: an abstract data type and two realizations.</title>
<pages>165-189</pages>
<year>2011</year>
<volume>48</volume>'''.splitlines()

out_lines = []
for line in input_lines:
    mat = re.match(r'<(?P<tag>[^>]+)>([^>]*)</(?P=tag)>', line)
    if mat: out_lines.append("%s-%s" % mat.groups())

print '\n'.join(out_lines)

输出：

author-Frank Drewes
author-Johanna Högberg
author-Andreas Maletti
title-MAT learners for tree series: an abstract data type and two realizations.
pages-165-189
year-2011
volume-48

Answer 3

您可以尝试使用以下re.sub命令，但它不会处理嵌套标记。

>>> import re
>>> s = '''<author>Frank Drewes</author>
<author>Johanna H&ouml;gberg</author>
<author>Andreas Maletti</author>
<title>MAT learners for tree series: an abstract data type and two realizations.</title>
<pages>165-189</pages>
<year>2011</year>
<volume>48</volume>
</article>'''
>>> m = re.sub(r'<(\w+)\b[^>]*>([^<]*)</\1>', r'\1-\2', s)
>>> print(re.sub(r'<[^<>]*>', '', m))
author-Frank Drewes
author-Johanna H&ouml;gberg
author-Andreas Maletti
title-MAT learners for tree series: an abstract data type and two realizations.
pages-165-189
year-2011
volume-48

将XML文件重新格式化为没有标记的文本文件

3 个答案: