我有一些表现良好的xml文件,我想使用正则表达式重新格式化(NOT PARSE!)。目标是让每个<trkpt>
对成为oneliner。
以下代码有效,但我想在单个正则表达式替换而不是循环中执行操作,因此我不需要将字符串连接回来。
import re
xml = """
<trkseg>
<trkpt lon="-51.2220657617" lat="-30.1072524581">
<time>2012-08-25T10:20:44Z</time>
<ele>0</ele>
</trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581">
<time>2012-08-25T10:20:44Z</time>
<ele>0</ele>
</trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581">
<time>2012-08-25T10:20:44Z</time>
<ele>0</ele>
</trkpt>
</trkseg>
"""
for trkpt in re.findall('<trkpt.*?</trkpt>', xml, re.DOTALL):
print re.sub('>\s*<', '><', trkpt, re.DOTALL)
使用sed
的答案也会受到欢迎。
感谢您阅读
答案 0 :(得分:2)
这个怎么样:
>>> regex = re.compile(
r"""\n[ \t]* # Match a newline plus following whitespace
(?= # only if...
(?: # ...the following can be matched:
(?!<trkpt) # (unless an opening <trkpt> tag occurs first)
. # any character
)* # any number of times,
</trkpt> # followed by a closing </trkpt> tag
) # End of lookahead""",
re.DOTALL | re.VERBOSE)
>>> print regex.sub("", xml)
<trkseg>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
</trkseg>
答案 1 :(得分:1)
这不是你要求的,但是为了成为一个单行,这里有一个单行:
>>> print re.sub(r'(<trkpt.*?</trkpt>)',
lambda m: re.sub(r'>\s*<', '><', m.group(1), re.DOTALL),
xml, flags=re.DOTALL)
<trkseg>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
</trkseg>
另请注意,如果任何字符串属性包含字符串"<trkpt"
,这种方法将会中断,这可能不会发生,但这是不使用真正解析器的问题。
答案 2 :(得分:1)
您想保留<trkseg>
吗?如果是这样,这可能适合您:
print re.sub('([^gt])>\s*<', '\g<1>><', xml, re.DOTALL)
删除元素之间的所有空格,条件是前一个元素不以t或g结尾。
<trkseg>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44Z</time><ele>0</ele></trkpt>
</trkseg>
答案 3 :(得分:1)
另一个单行是
print re.sub("(<trkpt.+?>).*?(<time>.+?</time>).*?(<ele>.+?</ele>).*?(</trkpt>)",
r'\1\2\3\4', xml, re.DOTALL)
产生
<trkseg>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44</time><ele>0</ele></trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44</time><ele>0</ele></trkpt>
<trkpt lon="-51.2220657617" lat="-30.1072524581"><time>2012-08-25T10:20:44</time><ele>0</ele></trkpt>
</trkseg>
这样做的好处是可以轻松更改其他标签。