Question

如何在不打开（）的情况下将utf8-bom添加到文本文件中？

理论上，我们只需要在文件的开头添加utf8-bom，我们就不需要读取所有＆＃39;内容？

Answer 1

您需要读取数据，因为您需要移动所有数据以为BOM腾出空间。文件不能只包含任意数据。做到这一点比仅使用BOM后跟原始数据编写新文件更难，然后替换原始文件，因此最简单的解决方案通常是：

import os
import shutil

from os.path import dirname, realpath
from tempfile import NamedTemporaryFile

infile = ...

# Open original file as UTF-8 and tempfile in same directory to add sig
indir = dirname(realpath(infile))
with NamedTemporaryFile(dir=indir, mode='w', encoding='utf-8-sig') as tf:
    with open(infile, encoding='utf-8') as f:
        # Copy from one file to the other by blocks 
        # (avoids memory use of slurping whole file at once)
        shutil.copyfileobj(f, tf)

    # Optional: Replicate metadata of original file
    tf.flush()
    shutil.copystat(f.name, tf.name) # Replicate permissions of original file

    # Atomically replace original file with BOM marked file
    os.replace(tf.name, f.name)

    # Don't try to delete temp file if everything worked
    tf.delete = False

这也通过副作用验证输入文件实际上是UTF-8，并且原始文件从不存在于不一致状态;它是旧数据或新数据，而不是中间工作副本。

如果您的文件很大且磁盘空间有限（因此您不能同时在磁盘上安装两个副本），则可能会接受就地突变。最简单的方法是使用mmap模块，与使用就地文件对象操作相比，它简化了大量移动数据的过程：

import codecs
import mmap

# Open file for read and write and then immediately map the whole file for write
with open(infile, 'r+b') as f, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
    origsize = mm.size()
    bomlen = len(codecs.BOM_UTF8)
    # Allocate additional space for BOM
    mm.resize(origsize+bomlen)

    # Copy file contents down to make room for BOM
    # This reads and writes the whole file, and is unavoidable
    mm.move(bomlen, 0, origsize)

    # Insert the BOM before the shifted data
    mm[:bomlen] = codecs.BOM_UTF8

Answer 2

如果您需要就地更新，例如

def add_bom(fname, bom=None, buf_size=None):
    bom = bom or BOM
    buf_size = buf_size or max(resource.getpagesize(), len(bom))
    buf = bytearray(buf_size)
    with open(fname, 'rb', 0) as in_fd, open(fname, 'rb+', 0) as out_fd:
        # we cannot just just read until eof, because we
        # will be writing to that very same file, extending it.
        out_fd.seek(0, 2)
        nbytes = out_fd.tell()
        out_fd.seek(0)
        # Actually, we want to pass buf[0:n_bytes], but 
        # that doesn't result in in-place updates.
        in_bytes = in_fd.readinto(buf)
        if in_bytes < len(bom) or not buf.startswith(bom):
            # don't write the BOM if it's already there
            out_fd.write(bom)
        while nbytes > 0:
            # if we still need to write data, do so.
            # but only write as much data as we need
            out_fd.write(buffer(buf, 0, min(in_bytes, nbytes)))
            nbytes -= in_bytes
            in_bytes = in_fd.readinto(buf)

应该这样做。

正如您所看到的，就地更新有点笨拙，因为您

将数据写入您刚才读到的地方。读取必须始终保持在写入之前，否则您将覆盖尚未处理的数据。
扩展您正在阅读的文件，因此阅读直到EOF不起作用。

此外，这可能会使文件处于不一致状态。副本到临时 - ＆gt;如果可能的话，将临时移动到原始方法是首选。

Python - 我可以在不打开文件的情况下将UTF8 BOM添加到文件中吗？

2 个答案: