Question

我编写了一个简单的Python脚本来将文件从一个地方复制到另一个地方。（这是为了上课，所以这就是为什么我没有使用像shutil那样简单的东西。我在最后检查了两个文件的哈希，并且它一直告诉我它们是不同的，即使复制成功 - 两者都是说“hello world”的文本文件。

这是我的代码：

import os

def validity_checker(address1, dest_name):
    try:
        src = open(address1, 'rb')
        dest = open(dest_name, 'wb+')
    except IOError:
        return False
    return True


def copaste(address1, address2):
    # concatenate address2 into filename
    file_ending = address1.split('\\').pop()
    dest_name = address2 + '\\' + file_ending

    # copy file after calling checker
    if validity_checker(address1, dest_name):
        src = open(address1, 'rb')
        dest = open(dest_name, 'wb+')
        contents = src.read()
        dest.write(contents)
        src.close()
        dest.close()
    else:
        print("File name bad. No action taken")

    print src
    print dest
    print(hash(src))  #hash the file not the string
    print(hash(dest))
    return

输出：

<closed file 'C:\\Users\\user\\Downloads\\hello.txt', mode 'rb' at 0x04B7D1D8>
<closed file 'C:\\Users\\user\\Downloads\\dest\\hello.txt', mode 'wb+' at 0x04C2B860>
-2042961099
4991878

此外，文件已被复制。

我很确定哈希是检查文件本身，而不是字符串。它可能与元数据有关吗？任何帮助将不胜感激。

Answer 1

您正在使用特定于Python的 hash() function，它计算用于词典键和set内容的哈希值。

对于文件对象，hash()基于对象标识;你不能把它建立在其他任何东西上，因为两个不同的文件对象永远不会相等，fileobject.__eq__方法只有当两个对象在内存中都是一个并且相同时才会返回True所以is也是如此。文件内容，文件名，模式或任何其他对象属性在生成的哈希值中不起作用。

从功能文档：

返回对象的哈希值（如果有）。哈希值是整数。 它们用于在字典查找期间快速比较字典键。

如果您需要验证文件副本是否包含相同的数据，则需要使用加密哈希函数来散列文件内容*，这是完全不同的。使用hashlib module;对于您的用例，简单快速的MD5算法将执行：

for closed_file in (src, dest):
    with open(closed_file.name, 'rb') as reopened:  # opened in binary mode!
        print(reopened.name)
        print(hashlib.md5(reopened.read()).hexdigest())

如果两个文件的二进制内容完全相同，那么它们的加密哈希也将是相同的。

Answer 2

您正在获取file对象的python哈希值。不是文件的内容。至少你应该

print(hash(open(address1, 'rb').read())
print(hash(open(dest_name, 'rb').read())

但是因为这仍然存在碰撞风险，所以你应该像Martijn建议的那样做并使用hashlib函数。

相同文件的Python哈希值不同

2 个答案: