我想读取带有unicode数据的文件,进行一些处理并写入另一个文件。
# -*- coding: utf-8 -*-
import sys
import codecs
def main(big_filename, small_filename):
print "Big file ", big_filename
print "Small file ", small_filename
pattern1 = u'CreationDate="2008'
pattern2 = u'CreationDate="2014'
small_f = codecs.open(small_filename, 'w', encoding='utf-8')
small_f.write('<?xml version="1.0" encoding="utf-8"?>\n')
small_f.write("<posts>\n")
cnt = 0
big_f = codecs.open(big_filename, 'r', encoding='utf-8')
for line in big_f:
#line = line.decode('utf-8')
if line.find(pattern1) != -1 or line.find(pattern2) != -1:
small_f.write(line)
cnt = cnt + 1
if cnt%1000:
print cnt, " records written"
small_f.write("<\posts>\n")
small_f.close()
if __name__ == "__main__":
main(sys.argv[1], sys.argv[2])
big_filename中的文字
Other hardware architectures fall back to the system’s timer, which is typically set to 100 Hz.
small_filename中的文本
Other hardware architectures fall back to the system’s timer, which is typically set to 100 Hz.
但是我看到,在small_filename中修改了一些unicode exts。有人能告诉我如何解决这个问题吗?
答案 0 :(得分:0)
字符串固有地包含encode
方法。试试这个:
small_read = open(small_filename).read()
small_file = open(small_filename, 'w')
使用与您完全一样的编解码器编写:
small_f = codecs.open(small_filename, 'w', encoding='utf-8')
small_f.write('<?xml version="1.0" encoding="utf-8"?>\n')
small_f.write("<posts>\n")
现在用utf-8编码编写文件内容,如下所示:
small_file.write(small_read.encode('utf-8'))
small_file.close()