如何在md5指纹识别时跳过不可用(损坏)的文件?

时间:2017-11-22 10:00:15

标签: python python-3.x

下面的代码生成md5 /元数据指纹,但崩溃未知的文件崩溃(例如,文件,可以复制,大多数甚至打开,但不能进行散列或压缩[以掩盖其损坏])

问题:如何将此代码设为跳过或忽略任何和所有问题文件,然后完成剩下的工作?想象一下8 TB的100万个文件。否则我让它继续运行并且没有实时监控进度,2天后我发现没有任何问题,因为一些问题文件导致代码挂起。

部分代码(请参阅下面的完整代码):

def createBasicInfoListFromDisk():

    global diskCompareListDetails, onlyFileNameOnDisk, driveLetter,walk_dir

    walk_dir = os.path.abspath(walk_dir)
    for root, subdirs, files in os.walk(walk_dir, topdown=True, onerror=None, followlinks=True ):
        for filename in files:
            file_path = os.path.join(root, filename)
            temp = file_path.split(":")
            driveLetter = temp[0]
            filePathWithoutDriveLetter = temp[1]
            fileSize = os.path.getsize(file_path)
            mod_on =  get_last_write_time(file_path)
            print('\t- file %s (full path: %s)' % (filename, file_path))
            print('FileName : {filename} is of size {size} and was modified on{mdt}'.format(filename=file_path,size=fileSize,mdt=mod_on ))

            diskCompareListDetails.append("\"" + filePathWithoutDriveLetter+"\",\""+str(fileSize) + "\",\"" + mod_on +'"')
            onlyFileNameOnDisk.append("\""+filePathWithoutDriveLetter+"\"")

    return

错误:

FileName : T:\problemtest\problemfile.doc is of size 27136 and was modified on2010-10-10 13:58:32
Traceback (most recent call last):
  File "t:\scripts\test.py", line 196, in <module>
    createBasicInfoListFromDisk()
  File "t:\scripts\test.py", line 76, in createBasicInfoListFromDisk
    mod_on =  get_last_write_time(file_path)
  File "t:\scripts\test.py", line 61, in get_last_write_time
    convert_time_to_human_readable = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(st.st_mtime))
OSError: [Errno 22] Invalid argument

完整代码:

import os
import sys
import time
import datetime
import difflib
import decimal
import hashlib
from pip._vendor.distlib.compat import raw_input

csvListDetails = list()
csvCompareListDetails = list()
diskCompareListDetails = list()
onlyFileNameOnDisk = list()
addedFiles = list()
removedFiles = list()
driveLetter =""
finalFilesToChange=list()
finalFilesToDelete=list()
changedFiles=list()
csvfilewithPath="md5.csv"
import shutil
walk_dir=""

def findAndReadCSVFile(fileName):

    global csvListDetails 
    global csvCompareListDetails
    haveIgnoredLine = 0
    foundFile=0

    try :
        inputFileHandler = open(fileName,"rt",encoding='utf-8')
        update_time = get_last_write_time(fileName)
        print("\n   Found md5.csv, last updated on: %s" % update_time)
        foundFile=1

    except (OSError, IOError, FileNotFoundError):
        print("\n   md5.csv not found. Will create a new one.")
        return foundFile

    for line in inputFileHandler:
        if (haveIgnoredLine==0):
            haveIgnoredLine=1
            continue

        rowItem = line.replace("\n","").split('","')
        csvCompareListDetails.append('"' + rowItem[3]+',"'+rowItem[2]+'","' +rowItem[1]+'"')
        lineDetails = list()

        for detailNum in range (0,len(rowItem)):
            lineDetails.append('"' + (rowItem[detailNum].replace('"','')) + '"')

        csvListDetails.append(lineDetails)

    inputFileHandler.close()

    return foundFile

def get_last_write_time(filename):
    st = os.stat(filename)
    convert_time_to_human_readable = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(st.st_mtime))
    return convert_time_to_human_readable

def createBasicInfoListFromDisk():

    global diskCompareListDetails, onlyFileNameOnDisk, driveLetter,walk_dir

    walk_dir = os.path.abspath(walk_dir)
    for root, subdirs, files in os.walk(walk_dir, topdown=True, onerror=None, followlinks=True ):
        for filename in files:
            file_path = os.path.join(root, filename)
            temp = file_path.split(":")
            driveLetter = temp[0]
            filePathWithoutDriveLetter = temp[1]
            fileSize = os.path.getsize(file_path)
            mod_on =  get_last_write_time(file_path)
            print('\t- file %s (full path: %s)' % (filename, file_path))
            print('FileName : {filename} is of size {size} and was modified on{mdt}'.format(filename=file_path,size=fileSize,mdt=mod_on ))

            diskCompareListDetails.append("\"" + filePathWithoutDriveLetter+"\",\""+str(fileSize) + "\",\"" + mod_on +'"')
            onlyFileNameOnDisk.append("\""+filePathWithoutDriveLetter+"\"")

    return

def compareLogAndDiskLists():
    global addedFiles, removedFiles

    diff = difflib.unified_diff(csvCompareListDetails, diskCompareListDetails, fromfile='file1', tofile='file2', lineterm='', n=0)
    lines = list(diff)[2:]
    addedFiles = [line[1:] for line in lines if line[0] == '+']
    removedFiles = [line[1:] for line in lines if line[0] == '-']

    return

def displayInfoForUserInput():
    global finalFilesToChange, finalFilesToDelete

    changedOrNewFileCount = 0
    noLongerExistingFilesCount = 0
    totalSizeOfChange = 0

    for line in addedFiles:
        if line not in removedFiles:

            changedOrNewFileCount = changedOrNewFileCount +1

            elements =  line.replace("\n","").split('","')
            sizeOfFile= int(elements[1].replace('"',''))
            totalSizeOfChange = totalSizeOfChange + sizeOfFile
            finalFilesToChange.append(elements[0] +'"')

    for line in removedFiles:

        elements = line.split('","')
        if elements[0]+'"' not in onlyFileNameOnDisk:
            noLongerExistingFilesCount = noLongerExistingFilesCount + 1
            finalFilesToDelete.append(elements[0]+'"')

    GBModSz= decimal.Decimal(totalSizeOfChange) / decimal.Decimal('1073741824')
    print("\n   New or modified files on drive: {} (need to hash)".format(changedOrNewFileCount))
    print ("   Obsolete lines in md5.csv (files modified or not on drive): {} (lines to delete)".format(noLongerExistingFilesCount))
    print ("   {} files ({:.2f} GB) needs to be hashed.".format(changedOrNewFileCount,GBModSz))

    userInput = raw_input("\n   Proceed with hash? (Y/N, Yes/No) ")

    if (userInput.strip().upper() == "Y" or userInput.strip().upper() == "YES"):
        print("Continuing Processing...")
    else:
        print("You opted not to continue, Exiting")
        sys.exit()

    return

def processFiles(foundFile):

    if (foundFile==1):
        oldFileName = walk_dir+"/md5.csv"
        shutil.copy( oldFileName, getTargetFileName(oldFileName))

    BLOCKSIZE = 1048576*4
    global changedFiles
    for fileToHash in finalFilesToChange:
        hasher = hashlib.new('md5')
        fileToUse=driveLetter+":"+fileToHash.replace('"','')
        with open(fileToUse, 'rb') as afile:
            buf = afile.read(BLOCKSIZE)
            while len(buf) > 0:
                hasher.update(buf)
                buf = afile.read(BLOCKSIZE)

        fileDetails = list()
        fileDetails.append(hasher.hexdigest())
        fileDetails.append(get_last_write_time(fileToUse))
        fileDetails.append(os.path.getsize(fileToUse))
        fileDetails.append(fileToHash)
        changedFiles.append(fileDetails)

    return 

def getTargetFileName(oldFileName):
    targetFileName= walk_dir+"/generated_on_" + get_last_write_time(oldFileName).replace(" ","_").replace("-","").replace(":","")
    targetFileName = targetFileName + "__archived_on_" + datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
    targetFileName = targetFileName + "__md5.csv"
    return targetFileName


def writeCSVFile(fileName):
    try :
        outputFileHandler=open(fileName,"wt",encoding='utf-8')
        outputFileHandler.write("\"md5Hash\",\"LastWriteTime\",\"Length\",\"FullName\"\n")
        for details in csvListDetails:
            if details[3] in finalFilesToDelete:
                continue
            if details[3] in finalFilesToChange:
                continue
            outputFileHandler.write("{},{},{},{}\n".format(details[0],details[1],details[2],details[3]))

        for details in changedFiles:
            outputFileHandler.write("\"{}\",\"{}\",\"{}\",{}\n".format(details[0],details[1],details[2],details[3]))
        outputFileHandler.close()

    except (OSError, IOError, FileNotFoundError) as e:
        print("ERROR :")
        print("File {} is either not writable or some other error: {}".format(fileName,e))

    return

if __name__ == '__main__':

    walk_dir = raw_input("\n   Enter drive or directory to scan: ")
    csvfilewithPath=walk_dir+"/md5.csv"
    print("\n   Drive to scan: " + walk_dir)   

    foundFile = 0
    foundFile=findAndReadCSVFile(csvfilewithPath)
    createBasicInfoListFromDisk()
    compareLogAndDiskLists()
    displayInfoForUserInput()
    processFiles(foundFile)
    writeCSVFile(csvfilewithPath)

尝试此修复,没有运气:

def get_last_write_time(filename):
    try:
        st = os.stat(filename)
        convert_time_to_human_readable = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(st.st_mtime))
    return convert_time_to_human_readable
    except OSError:
        pass
    return "ERROR"

def createBasicInfoListFromDisk():

2 个答案:

答案 0 :(得分:1)

我同意IMCoins并且我非常了解为什么除了没有发现错误。

首先,我要做的是转到引发OSError的源代码,并试图明确地捕捉它。

def get_last_write_time(filename):
   try:
      st = os.stat(filename)
      convert_time_to_human_readable = time.strftime("%Y-%m-%d %H:%M:%S",
                                                     time.localtime(st.st_mtime)
   return convert_time_to_human_readable
   except OSError:
      pass
   return "ERROR" #or whatever string you want add

答案 1 :(得分:0)

更新了答案,了解更新后的帖子。

如前所述,指定了异常类型的except语句捕获 所有内容 。所以,为了做想要做的事情......我害怕可能的答案是:

  • 创建一个识别已损坏文件的方法,并正确处理它。

  • 制作try, except语句,将代码的每个部分封装在可能存在错误的位置。

让我警告你关于第二个解决方案,但有时会有你不想避免的系统错误。我相信你应该打印你捕获的异常,以便找出你可能遇到的其他问题。

您也知道,因为您可能没有:您的错误不在try, except声明中。您的错误在(如果我在我的编辑器中正确复制和粘贴)第196行,createBasicinfoListFromDisk(),然后第76行,mod_on = get_last_write_time(file_path)

正如您还提到的,您使用的是python 3.x,我建议您查看suppress函数(https://docs.python.org/3/library/contextlib.html#contextlib.suppress)。

我希望它对你有所帮助。