在python unicode中修改的unicode文本

时间:2015-01-14 14:55:33

标签: python unicode

我想读取带有unicode数据的文件,进行一些处理并写入另一个文件。

# -*- coding: utf-8 -*-
import sys
import codecs


def main(big_filename, small_filename):

    print "Big file ", big_filename
    print "Small file ", small_filename
    pattern1 = u'CreationDate="2008'
    pattern2 = u'CreationDate="2014'

    small_f = codecs.open(small_filename, 'w', encoding='utf-8')
    small_f.write('<?xml version="1.0" encoding="utf-8"?>\n')
    small_f.write("<posts>\n")

    cnt = 0
    big_f = codecs.open(big_filename, 'r', encoding='utf-8')
    for line in big_f:
        #line = line.decode('utf-8')
        if line.find(pattern1) != -1 or line.find(pattern2) != -1:
            small_f.write(line)
            cnt = cnt + 1
            if cnt%1000:
                print cnt, " records written"

    small_f.write("<\posts>\n")        
    small_f.close()

if __name__ == "__main__":
    main(sys.argv[1], sys.argv[2])

big_filename中的文字

Other hardware architectures fall back to the system’s timer, which is typically set to 100 Hz.

small_filename中的文本

Other hardware architectures fall back to the system’s timer, which is typically set to 100 Hz.

但是我看到,在small_filename中修改了一些unicode exts。有人能告诉我如何解决这个问题吗?

1 个答案:

答案 0 :(得分:0)

字符串固有地包含encode方法。试试这个:

  1. 读取小文件并转储为变量,如下所示:small_read = open(small_filename).read()
  2. 使用新的空白文件覆盖文件:small_file = open(small_filename, 'w')
  3. 使用与您完全一样的编解码器编写:

    small_f = codecs.open(small_filename, 'w', encoding='utf-8') small_f.write('<?xml version="1.0" encoding="utf-8"?>\n') small_f.write("<posts>\n")

  4. 现在用utf-8编码编写文件内容,如下所示:

    small_file.write(small_read.encode('utf-8')) small_file.close()