Question

我想计算文件的CRC并得到如下输出：E45A12AC。这是我的代码：

#!/usr/bin/env python 
import os, sys
import zlib

def crc(fileName):
    fd = open(fileName,"rb")
    content = fd.readlines()
    fd.close()
    for eachLine in content:
        zlib.crc32(eachLine)

for eachFile in sys.argv[1:]:
    crc(eachFile)

这会计算每一行的CRC，但其输出（例如-1767935985）不是我想要的。

Hashlib以我想要的方式工作，但它计算md5：

import hashlib
m = hashlib.md5()
for line in open('data.txt', 'rb'):
    m.update(line)
print m.hexdigest()

是否可以使用zlib.crc32获得类似内容？

Answer 1

更紧凑和优化的代码

def crc(fileName):
    prev = 0
    for eachLine in open(fileName,"rb"):
        prev = zlib.crc32(eachLine, prev)
    return "%X"%(prev & 0xFFFFFFFF)

PS2：由于评论中的建议，旧PS已弃用 - 因此已删除 - 谢谢。我不明白，我多么想念它，但它真的很棒。

Answer 2

用于CRC-32支持的

hashlib 兼容接口：

import zlib

class crc32(object):
    name = 'crc32'
    digest_size = 4
    block_size = 1

    def __init__(self, arg=''):
        self.__digest = 0
        self.update(arg)

    def copy(self):
        copy = super(self.__class__, self).__new__(self.__class__)
        copy.__digest = self.__digest
        return copy

    def digest(self):
        return self.__digest

    def hexdigest(self):
        return '{:08x}'.format(self.__digest)

    def update(self, arg):
        self.__digest = zlib.crc32(arg, self.__digest) & 0xffffffff

# Now you can define hashlib.crc32 = crc32
import hashlib
hashlib.crc32 = crc32

# Python > 2.7: hashlib.algorithms += ('crc32',)
# Python > 3.2: hashlib.algorithms_available.add('crc32')

Answer 3

要将任何整数的最低32位显示为8个十六进制数字，没有符号，您可以通过按位“掩码”该值，并使用32位的掩码全部为1，然后应用格式。即：

>>> x = -1767935985
>>> format(x & 0xFFFFFFFF, '08x')
'969f700f'

因此格式化的整数是来自zlib.crc32还是来自任何其他计算，这是无关紧要的。

Answer 4

合并上述2个代码如下：

try:
    fd = open(decompressedFile,"rb")
except IOError:
    logging.error("Unable to open the file in readmode:" + decompressedFile)
    return 4
eachLine = fd.readline()
prev = 0
while eachLine:
    prev = zlib.crc32(eachLine, prev)
    eachLine = fd.readline()
fd.close()

Answer 5

kobor42答案的修改版本，通过读取固定大小的块而不是“行”，性能提高了2-3倍：

def crc32(fileName):
    fh = open(fileName, 'rb')
    hash = 0
    while True:
        s = fh.read(65536)
        if not s:
            break
        hash = zlib.crc32(s, hash)
    fh.close()
    return "%08X" % (hash & 0xFFFFFFFF)

还在返回的字符串中包含前导零。

Answer 6

Python 3.8+（使用walrus运算符）：

import zlib

def crc32(filename, chunksize=65536):
    """Compute the CRC-32 checksum of the contents of the given filename"""
    with open(filename, "rb") as f:
        checksum = 0
        while (chunk := f.read(chunksize)) :
            checksum = zlib.crc32(chunk, checksum)
        return checksum

chunksize是一次读取文件的字节数。设置为无关紧要的是，您将为同一文件获得相同的哈希值（将其设置得太低可能会使您的代码运行缓慢，而设置得太高则可能会占用太多内存）。

结果是一个32位整数。空文件的CRC-32校验和为0。

Answer 7

使用for循环和文件缓冲，CrouZ答案的修改后的版本和更紧凑的版本，性能略有提高：

def forLoopCrc(fpath):
    """With for loop and buffer."""
    crc = 0
    with open(fpath, 'rb', 65536) as ins:
        for x in range(int((os.stat(fpath).st_size / 65536)) + 1):
            crc = zlib.crc32(ins.read(65536), crc)
    return '%08X' % (crc & 0xFFFFFFFF)

在6700k SSD中结果：

（注意：经过多次测试，而且速度更快。）

Warming up the machine...
Finished.

Beginning tests...
File size: 77966KB
Test cycles: 500

With for loop and buffer.
Result 39.64133464173549 

CrouZ solution
Result 39.76574074476219 

kobor42 solution
Result 91.6181196155832

使用以下脚本在Python 3.6 x64中进行了测试：

import os, timeit, zlib, random

def forLoopCrc(fpath):
    """With for loop and buffer."""
    crc = 0
    with open(fpath, 'rb', 65536) as ins:
        for x in range(int((os.stat(fpath).st_size / 65536)) + 1):
            crc = zlib.crc32(ins.read(65536), crc)
    return '%08X' % (crc & 0xFFFFFFFF)

def crc32(fileName):
    """CrouZ solution"""
    with open(fileName, 'rb') as fh:
        hash = 0
        while True:
            s = fh.read(65536)
            if not s:
                break
            hash = zlib.crc32(s, hash)
        return "%08X" % (hash & 0xFFFFFFFF)

def crc(fileName):
    """kobor42 solution"""
    prev = 0
    for eachLine in open(fileName,"rb"):
        prev = zlib.crc32(eachLine, prev)
    return "%X"%(prev & 0xFFFFFFFF)

fpath = r'D:\test\test.dat'
tests = {forLoopCrc: 'With for loop and buffer.', 
     crc32: 'CrouZ solution', crc: 'kobor42 solution'}
count = 500

# CPU, HDD warmup
randomItm = [x for x in tests.keys()]
random.shuffle(randomItm)
print('\nWarming up the machine...')
for c in range(count):
    randomItm[0](fpath)
print('Finished.\n')

# Begin test
print('Beginning tests...\nFile size: %dKB\nTest cycles: %d\n' % (
    os.stat(fpath).st_size/1024, count))
for x in tests:
    print(tests[x])
    start_time = timeit.default_timer()
    for c in range(count):
        x(fpath)
    print('Result', timeit.default_timer() - start_time, '\n')

它更快，因为 for 循环比 while 循环（源：here和here）快。

Answer 8

你可以像[ERD45FTR]一样使用base64。 zlib.crc32提供了更新选项。

import os, sys
import zlib
import base64

def crc(fileName):
  fd = open(fileName,"rb")
  content = fd.readlines()
  fd.close()
  prev = None
  for eachLine in content:
   if not prev:
     prev = zlib.crc32(eachLine)
   else:
     prev = zlib.crc32(eachLine, prev)
  return prev

for eachFile in sys.argv[1:]:
  print base64.b64encode(str(crc(eachFile)))

Answer 9

溶液：

import os, sys
import zlib

def crc(fileName, excludeLine="", includeLine=""):
  try:
        fd = open(fileName,"rb")
  except IOError:
        print "Unable to open the file in readmode:", filename
        return
  eachLine = fd.readline()
  prev = None
  while eachLine:
      if excludeLine and eachLine.startswith(excludeLine):
            continue   
      if not prev:
        prev = zlib.crc32(eachLine)
      else:
        prev = zlib.crc32(eachLine, prev)
      eachLine = fd.readline()
  fd.close()    
  return format(prev & 0xFFFFFFFF, '08x') #returns 8 digits crc

for eachFile in sys.argv[1:]:
    print crc(eachFile)

不知道是什么（excludeLine =“”，includeLine =“”）......

Answer 10

有一种使用 binascii 计算 CRC 的更快、更紧凑的方法：

import binascii

def crc32(filename):
    buf = open(filename,'rb').read()
    hash = binascii.crc32(buf) & 0xFFFFFFFF
    return "%08X" % hash

在python中计算文件的crc

10 个答案: