Question

我想读取带有unicode数据的文件，进行一些处理并写入另一个文件。

# -*- coding: utf-8 -*-
import sys
import codecs


def main(big_filename, small_filename):

    print "Big file ", big_filename
    print "Small file ", small_filename
    pattern1 = u'CreationDate="2008'
    pattern2 = u'CreationDate="2014'

    small_f = codecs.open(small_filename, 'w', encoding='utf-8')
    small_f.write('<?xml version="1.0" encoding="utf-8"?>\n')
    small_f.write("<posts>\n")

    cnt = 0
    big_f = codecs.open(big_filename, 'r', encoding='utf-8')
    for line in big_f:
        #line = line.decode('utf-8')
        if line.find(pattern1) != -1 or line.find(pattern2) != -1:
            small_f.write(line)
            cnt = cnt + 1
            if cnt%1000:
                print cnt, " records written"

    small_f.write("<\posts>\n")        
    small_f.close()

if __name__ == "__main__":
    main(sys.argv[1], sys.argv[2])

big_filename中的文字

Other hardware architectures fall back to the system’s timer, which is typically set to 100 Hz.

small_filename中的文本

Other hardware architectures fall back to the systemâ€™s timer, which is typically set to 100 Hz.

但是我看到，在small_filename中修改了一些unicode exts。有人能告诉我如何解决这个问题吗？

Answer 1

字符串固有地包含encode方法。试试这个：

读取小文件并转储为变量，如下所示：small_read = open(small_filename).read()
使用新的空白文件覆盖文件：small_file = open(small_filename, 'w')
使用与您完全一样的编解码器编写：

small_f = codecs.open(small_filename, 'w', encoding='utf-8') small_f.write('<?xml version="1.0" encoding="utf-8"?>\n') small_f.write("<posts>\n")
现在用utf-8编码编写文件内容，如下所示：

small_file.write(small_read.encode('utf-8')) small_file.close()

在python unicode中修改的unicode文本

1 个答案: