Question

我有几个我应该处理的文件。这些文件是xml文件，但在＆＃34;之前＆LT; ？xml版本=＆＃34; 1.0＆＃34;？＆GT; ＆＃34;，来自命令行的一些调试和状态行。由于我想削减文件，因此必须删除这些行。我的问题是：这怎么可能？优选地，即文件名保持不变。

感谢您的帮助。

Answer 1

低效的解决方案是阅读整个内容并找到发生的位置：

fileName="yourfile.xml"
with open(fileName,'r+') as f:
  contents=f.read()
  contents=contents[contents.find("< ?xml version="1.0"? >"):]
  f.seek(0)
  f.write(contents)
  f.truncate()

该文件现在将包含来自＆＃34;＆lt;的原始文件内容。？xml版本=＆＃34; 1.0＆＃34;？＆GT;＆＃34;起。

Answer 2

在阅读文件时修剪文件头怎么办？

import xml.etree.ElementTree as et

with open("input.xml", "rb") as inf:
    # find starting point
    offset = 0
    for line in inf:
        if line.startswith('<?xml version="1.0"'):
            break
        else:
            offset += len(line)

    # read the xml file starting at that point
    inf.seek(offset)
    data = et.parse(inf)

（这假设xml标题在它自己的行上开始，但在我的测试文件上起作用：

<!-- This is a line of junk -->
<!-- This is another -->
<?xml version="1.0" ?>
<abc>
    <def>xy</def>
    <def>hi</def>
</abc>

Answer 3

由于您说您有多个文件，因此使用fileinput可能优于open。然后你可以做类似的事情：

import fileinput
import sys

prolog = '< ?xml version="1.0"? >'
reached_prolog = False
files = ['file1.xml', 'file2.xml'] # The paths of all your XML files
for line in fileinput.input(files, inplace=1):
    # Decide how you want to remove the lines. Something like:
    if line.startswith(prolog) and not reached_prolog:
        continue
    else:
        reached_prolog = True
        sys.stdout.write(line)

阅读fileinput的文档应该让事情变得更清楚。

P.S。这只是一个快速反应;我没有运行/测试代码。

Answer 4

使用regexp的解决方案：

import re
import shutil

with open('myxml.xml') as ifile, open('tempfile.tmp', 'wb') as ofile:
    for line in ifile:
        matches = re.findall(r'< \?xml version="1\.0"\? >.+', line)
        if matches:
            ofile.write(matches[0])
            ofile.writelines(ifile)
            break
    shutil.move('tempfile.tmp', 'myxml.xml')

Python：在某些字符之前删除所有内容

4 个答案: