用','替换换行符和XML标记

时间:2010-12-28 20:57:50

标签: python xml regex

我有一个XML文档,如下所示:

  <file>
    <name>NAME_OF_FILE</name>
  </file>
  <file>
    <name>NAME_OF_FILE</name>
  </file>

我正在尝试编写一个Python脚本,用','替换标记之间的所有换行符,标记和空格(即不是元素本身)。

上述文件的输出应如下所示:

NAME_OF_FILE','NAME_OF_FILE','NAME_OF_FILE','

这是我到目前为止所得到的。我无法准确理解Python如何处理换行符:

import sys
import os
import re

source = r'c:\A\grepper.txt'

f = open(source,'r')
out = open(r'c:\A\bout.txt', 'a')

for line in f:
    one = re.sub(r"\n", '', line)
    two = re.sub(r"\r", '', one)
    three = re.sub(r'</name>.*<name>', '\',\'', two)
    out.write(three)

out.close()

4 个答案:

答案 0 :(得分:2)

删除r,因为它们按字面意思引用字符串。

one = re.sub("\n", '', line)
two = re.sub("\r", '', one)

您还可以使用string.replace()进行这些简单的替换,并将它们合并为一行。

line = re.sub('r</name>.*<name>', "','", line.replace('\n', '').replace('\r', ''))
out.write(line)

然而,这仍然无法解决获得所需输出的问题。我建议做以下事情:

results = []
for line in f:
    match = re.search(r'<name>(.*)</name>', line)
    if match:
        results.append(match.group(1))
print >>out, "','".join(results)

这是有效的:http://ideone.com/ik48G

答案 1 :(得分:0)

而不是替换你可能想要考虑匹配你想要的东西:

tag_re = re.compile('''
    <(?P<tag>[a-z]+)> # First match the tag, must be a-z enclosed in <>
    (?P<value>[^<>]+) # Match the value, anything but <>
    </(?P=tag)> # Match the same tag we got earlier, but the closing version
''', re.VERBOSE)
print "','".join(m.group('value') for m in tag_re.finditer(data))

答案 2 :(得分:0)

正则表达式是错误的。使用xml.sax.handler模块。

未测试:

import xml.sax
from xml.sax.handler import ContentHandler

class CharactersOnlyContentHandler(ContentHandler):
    def __init__(self):
        ContentHandler.__init__(self)
        self.text = ""
        self.texts = []

    def characters(self, content):
        self.text += content

    def endElement(self, name):
        if self.text:
            self.texts.append(self.text)
            self.text = ""

handler = CharactersOnlyContentHandler()
xml.sax.parse(xml_file_name, handler)
print ",".join("'%s'" % s for s in handler.texts)

答案 3 :(得分:0)

import lxml.etree

myxml = """
<filelist>
    <file>
        <name>FIRST FILE NAME</name>
    </file>
    <file>
        <name>SECOND FILE NAME</name>
    </file>
</filelist>
"""

root = lxml.etree.fromstring(myxml)
filenames = root.xpath('//file/name/text()')
print ', '.join(filenames)

结果

FIRST FILE NAME, SECOND FILE NAME