检查索引文件中是否存在MD5值

时间:2018-11-26 19:37:42

标签: python md5 scanning

我试图找到一种方法来验证我的代码是否可以交叉验证索引文件中url字符串的md5转换值的存在,如果是,则跳过扫描。

下面是我的代码

形成的URL被转换为md5字符串,然后在扫描完成后存储在idx文件中,目标是将来的扫描不应获取相同的URL。我看到的问题是if str(md5url) in line没有得到执行,可能是因为在将哈希添加到文件时没有使用'\ n'作为后缀。但是我尝试过它仍然不起作用。

有什么想法吗?

def computeMD5hash(string_for_hash):
    m = hashlib.md5()
    m.update(string_for_hash.encode('utf-8'))
    return m.hexdigest()


def writefilehash(formation_URL):
    fn="urlindex.idx"
    try:
        afile = open(fn, 'a')
        afile.write(computeMD5hash(formation_URL))
        afile.close()
    except IOError:
        print("Error writing to the index file")

fn="urlindex.idx"
try:
    afile = open(fn, 'r')
except IOError:
    afile = open(fn, 'w')

for f in files:
    formation=repouri + "/" + f
    #print(computeMD5hash(formation))
    md5url=computeMD5hash(formation)
    hashlist = afile.readlines()
    for line in hashlist:
        if str(md5url) in line:
            print ("Skipping " + formation + " because its already scanned and indexed as  " + line)
        else:
            if downloadengine(formation):
                print ("Download completed " + formation)
                print ("Starting to write to database..")
                #writetodatabase()
                print ("Writing hash value ..")
                writefilehash(formation)

print("Closing..")
afile.close()

1 个答案:

答案 0 :(得分:1)

您正在循环测试 。对于每条不匹配的行,请下载:

line1
    if hash in line:
        print something
    else
        download
line2
    if hash in line:
        print something
    else
        download
line3
    if hash in line:
        print something
    else
        download

如果散列在第1行中,则您仍将下载,因为散列不在第2行或第3行中。除非测试了所有行,否则您不应该决定下载

>

执行此操作的最佳方法是一次性将所有哈希读取到集合对象中(因为对集合进行包含性测试更快)。删除行分隔符:

try:
    with open(fn) as hashfile:
        hashes = {line.strip() for line in hashfile}
except IOError:
    # no file yet, just use an empty set
    hashes = set()

然后在测试新哈希时使用:

urlhash = computeMD5hash(formation)
if urlhash not in hashes:
    # not seen before, download
    # record the hash
    hashes.add(urlhash)
    with open(fn, 'a') as hashfile:
        hashfile.write(urlhash + '\n')