如何在不打开()的情况下将utf8-bom添加到文本文件中?
理论上,我们只需要在文件的开头添加utf8-bom,我们就不需要读取所有'内容?
答案 0 :(得分:3)
您需要读取数据,因为您需要移动所有数据以为BOM腾出空间。文件不能只包含任意数据。做到这一点比仅使用BOM后跟原始数据编写新文件更难,然后替换原始文件,因此最简单的解决方案通常是:
import os
import shutil
from os.path import dirname, realpath
from tempfile import NamedTemporaryFile
infile = ...
# Open original file as UTF-8 and tempfile in same directory to add sig
indir = dirname(realpath(infile))
with NamedTemporaryFile(dir=indir, mode='w', encoding='utf-8-sig') as tf:
with open(infile, encoding='utf-8') as f:
# Copy from one file to the other by blocks
# (avoids memory use of slurping whole file at once)
shutil.copyfileobj(f, tf)
# Optional: Replicate metadata of original file
tf.flush()
shutil.copystat(f.name, tf.name) # Replicate permissions of original file
# Atomically replace original file with BOM marked file
os.replace(tf.name, f.name)
# Don't try to delete temp file if everything worked
tf.delete = False
这也通过副作用验证输入文件实际上是UTF-8,并且原始文件从不存在于不一致状态;它是旧数据或新数据,而不是中间工作副本。
如果您的文件很大且磁盘空间有限(因此您不能同时在磁盘上安装两个副本),则可能会接受就地突变。最简单的方法是使用mmap
模块,与使用就地文件对象操作相比,它简化了大量移动数据的过程:
import codecs
import mmap
# Open file for read and write and then immediately map the whole file for write
with open(infile, 'r+b') as f, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
origsize = mm.size()
bomlen = len(codecs.BOM_UTF8)
# Allocate additional space for BOM
mm.resize(origsize+bomlen)
# Copy file contents down to make room for BOM
# This reads and writes the whole file, and is unavoidable
mm.move(bomlen, 0, origsize)
# Insert the BOM before the shifted data
mm[:bomlen] = codecs.BOM_UTF8
答案 1 :(得分:1)
如果您需要就地更新,例如
def add_bom(fname, bom=None, buf_size=None):
bom = bom or BOM
buf_size = buf_size or max(resource.getpagesize(), len(bom))
buf = bytearray(buf_size)
with open(fname, 'rb', 0) as in_fd, open(fname, 'rb+', 0) as out_fd:
# we cannot just just read until eof, because we
# will be writing to that very same file, extending it.
out_fd.seek(0, 2)
nbytes = out_fd.tell()
out_fd.seek(0)
# Actually, we want to pass buf[0:n_bytes], but
# that doesn't result in in-place updates.
in_bytes = in_fd.readinto(buf)
if in_bytes < len(bom) or not buf.startswith(bom):
# don't write the BOM if it's already there
out_fd.write(bom)
while nbytes > 0:
# if we still need to write data, do so.
# but only write as much data as we need
out_fd.write(buffer(buf, 0, min(in_bytes, nbytes)))
nbytes -= in_bytes
in_bytes = in_fd.readinto(buf)
应该这样做。
正如您所看到的,就地更新有点笨拙,因为您
此外,这可能会使文件处于不一致状态。副本到临时 - &gt;如果可能的话,将临时移动到原始方法是首选。