Python字符串相等性测试提供不一致的结果

时间:2011-11-19 22:14:23

标签: python

在脚本中使用以下函数来创建Django站点的静态版本:

def write_file(filename, content):
    filename = '{0}{1}.html'.format(BASEDIR, filename)
    if os.path.exists(filename):
        existing_file = io.open(filename, encoding='utf-8')
        existing_content = existing_file.read()
        existing_file.close()
        if existing_content != content:
            print "Content is not equal, writing file to {0}".format(filename)
            encoded_content = content.encode('utf-8')
            html_file = open(filename, 'w')
            html_file.write(encoded_content)
            html_file.close()
        else:
            print "Content is equal, nothing is written to {0}".format(filename)

当我运行脚本两次(没有对数据库进行任何更改)时,人们会期望根本没有写入操作。奇怪的是,超过一半的文件是一遍又一遍地写的。

2 个答案:

答案 0 :(得分:0)

您所描述的是在过程中某处被编码两次的数据或与unicode进行比较的文本的症状。在Python 2.x中,abc` == u`abc所以一些只包含ASCII的文件将通过比较测试,另一半文件中的非ascii字符在UTF-8编码之前和之后都是相同的

告诉正在发生的事情的最简单方法是改进代码中的错误报告:在else子句之后,添加:

print repr(existing_content), repr(content)

答案 1 :(得分:0)

我建议使用codecs模块;像这样的东西:

import codecs

def write_file(filename, content):
    filename = "{0}{1}.html".format(BASEDIR, filename)   
    if os.path.exists(filename):

        # open file and read into a utf8 string.
        # Calling open(), read(), then close() can all be made into 1 call.
        # python will handle the closing and gc for you
        existing_content = codecs.open(filename, "r", "utf-8").read()

        if existing_content != content.encode("utf-8"):
            print "Content is not equal, writing file to {0}".format(filename)
            
            # python will close the open fd for you after this
            # codecs will handle the utf8 conversion before writing to the file, so no need to encode 'content'
            codecs.open(filename, "w", "utf-8").write(content)

            # Although, it might be necessary to write the utf-8 Byte-Order Marker first:
            outF = open(filename, "w")
            outF.write(codecs.BOM_UTF8)
            outF.write(content.encode("utf-8"))
            outF.close()
        else:
            print "Content is equal, nothing is written to {0}".format(filename)

很多好消息:How to use utf-8 with python