我有一个文本文件: -
<author>Frank Drewes</author>
<author>Johanna Högberg</author>
<author>Andreas Maletti</author>
<title>MAT learners for tree series: an abstract data type and two realizations.</title>
<pages>165-189</pages>
<year>2011</year>
<volume>48</volume>
</article>
我需要删除其中的所有角括号: - 并在角括号内提供括号和括号连接的名称; -
author-Frank Drewes
author-Johanna Högberg
author-Andreas Maletti
title-MAT learners for tree series: an abstract data type and two realizations.
pages-165-189
year-2011
volume-48
答案 0 :(得分:2)
相反,潜入正则表达式的精彩世界,我会使用特定工具 - 解析器,如lxml
。
工作示例:
from lxml.html import fromstring
data = u"""
<article>
<author>Frank Drewes</author>
<author>Johanna Högberg</author>
<author>Andreas Maletti</author>
<title>MAT learners for tree series: an abstract data type and two realizations.</title>
<pages>165-189</pages>
<year>2011</year>
<volume>48</volume>
</article>
"""
root = fromstring(data)
for element in root.iterchildren():
print '%s-%s' % (element.tag, element.text_content())
打印:
author-Frank Drewes
author-Johanna Högberg
author-Andreas Maletti
title-MAT learners for tree series: an abstract data type and two realizations.
pages-165-189
year-2011
volume-48
答案 1 :(得分:1)
拜托,请尽可能用alexce的方法。 (如果没有,试着找到一种方法使其成为可能 - see this answer for rationale)。我只是把它扔在这里以供变化。
使用re.match
,命名组和反向引用。
import re
input_lines = '''<author>Frank Drewes</author>
<author>Johanna Högberg</author>
<author>Andreas Maletti</author>
<title>MAT learners for tree series: an abstract data type and two realizations.</title>
<pages>165-189</pages>
<year>2011</year>
<volume>48</volume>'''.splitlines()
out_lines = []
for line in input_lines:
mat = re.match(r'<(?P<tag>[^>]+)>([^>]*)</(?P=tag)>', line)
if mat: out_lines.append("%s-%s" % mat.groups())
print '\n'.join(out_lines)
输出:
author-Frank Drewes author-Johanna Högberg author-Andreas Maletti title-MAT learners for tree series: an abstract data type and two realizations. pages-165-189 year-2011 volume-48
答案 2 :(得分:0)
您可以尝试使用以下re.sub
命令,但它不会处理嵌套标记。
>>> import re
>>> s = '''<author>Frank Drewes</author>
<author>Johanna Högberg</author>
<author>Andreas Maletti</author>
<title>MAT learners for tree series: an abstract data type and two realizations.</title>
<pages>165-189</pages>
<year>2011</year>
<volume>48</volume>
</article>'''
>>> m = re.sub(r'<(\w+)\b[^>]*>([^<]*)</\1>', r'\1-\2', s)
>>> print(re.sub(r'<[^<>]*>', '', m))
author-Frank Drewes
author-Johanna Högberg
author-Andreas Maletti
title-MAT learners for tree series: an abstract data type and two realizations.
pages-165-189
year-2011
volume-48