Python - 我可以在不打开文件的情况下将UTF8 BOM添加到文件中吗?

时间:2016-08-23 06:46:29

标签: python unicode utf-8 byte-order-mark

如何在不打开()的情况下将utf8-bom添加到文本文件中?

理论上,我们只需要在文件的开头添加utf8-bom,我们就不需要读取所有'内容?

2 个答案:

答案 0 :(得分:3)

您需要读取数据,因为您需要移动所有数据以为BOM腾出空间。文件不能只包含任意数据。做到这一点比仅使用BOM后跟原始数据编写新文件更难,然后替换原始文件,因此最简单的解决方案通常是:

import os
import shutil

from os.path import dirname, realpath
from tempfile import NamedTemporaryFile

infile = ...

# Open original file as UTF-8 and tempfile in same directory to add sig
indir = dirname(realpath(infile))
with NamedTemporaryFile(dir=indir, mode='w', encoding='utf-8-sig') as tf:
    with open(infile, encoding='utf-8') as f:
        # Copy from one file to the other by blocks 
        # (avoids memory use of slurping whole file at once)
        shutil.copyfileobj(f, tf)

    # Optional: Replicate metadata of original file
    tf.flush()
    shutil.copystat(f.name, tf.name) # Replicate permissions of original file

    # Atomically replace original file with BOM marked file
    os.replace(tf.name, f.name)

    # Don't try to delete temp file if everything worked
    tf.delete = False

这也通过副作用验证输入文件实际上是UTF-8,并且原始文件从不存在于不一致状态;它是旧数据或新数据,而不是中间工作副本。

如果您的文件很大且磁盘空间有限(因此您不能同时在磁盘上安装两个副本),则可能会接受就地突变。最简单的方法是使用mmap模块,与使用就地文件对象操作相比,它简化了大量移动数据的过程:

import codecs
import mmap

# Open file for read and write and then immediately map the whole file for write
with open(infile, 'r+b') as f, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
    origsize = mm.size()
    bomlen = len(codecs.BOM_UTF8)
    # Allocate additional space for BOM
    mm.resize(origsize+bomlen)

    # Copy file contents down to make room for BOM
    # This reads and writes the whole file, and is unavoidable
    mm.move(bomlen, 0, origsize)

    # Insert the BOM before the shifted data
    mm[:bomlen] = codecs.BOM_UTF8

答案 1 :(得分:1)

如果您需要就地更新,例如

def add_bom(fname, bom=None, buf_size=None):
    bom = bom or BOM
    buf_size = buf_size or max(resource.getpagesize(), len(bom))
    buf = bytearray(buf_size)
    with open(fname, 'rb', 0) as in_fd, open(fname, 'rb+', 0) as out_fd:
        # we cannot just just read until eof, because we
        # will be writing to that very same file, extending it.
        out_fd.seek(0, 2)
        nbytes = out_fd.tell()
        out_fd.seek(0)
        # Actually, we want to pass buf[0:n_bytes], but 
        # that doesn't result in in-place updates.
        in_bytes = in_fd.readinto(buf)
        if in_bytes < len(bom) or not buf.startswith(bom):
            # don't write the BOM if it's already there
            out_fd.write(bom)
        while nbytes > 0:
            # if we still need to write data, do so.
            # but only write as much data as we need
            out_fd.write(buffer(buf, 0, min(in_bytes, nbytes)))
            nbytes -= in_bytes
            in_bytes = in_fd.readinto(buf)

应该这样做。

正如您所看到的,就地更新有点笨拙,因为您

  1. 将数据写入您刚才读到的地方。读取必须始终保持在写入之前,否则您将覆盖尚未处理的数据。
  2. 扩展您正在阅读的文件,因此阅读直到EOF不起作用。
  3. 此外,这可能会使文件处于不一致状态。副本到临时 - &gt;如果可能的话,将临时移动到原始方法是首选。