Question

我试图找到一种方法来验证我的代码是否可以交叉验证索引文件中url字符串的md5转换值的存在，如果是，则跳过扫描。

下面是我的代码

形成的URL被转换为md5字符串，然后在扫描完成后存储在idx文件中，目标是将来的扫描不应获取相同的URL。我看到的问题是if str(md5url) in line没有得到执行，可能是因为在将哈希添加到文件时没有使用'\ n'作为后缀。但是我尝试过它仍然不起作用。

有什么想法吗？

def computeMD5hash(string_for_hash):
    m = hashlib.md5()
    m.update(string_for_hash.encode('utf-8'))
    return m.hexdigest()


def writefilehash(formation_URL):
    fn="urlindex.idx"
    try:
        afile = open(fn, 'a')
        afile.write(computeMD5hash(formation_URL))
        afile.close()
    except IOError:
        print("Error writing to the index file")

fn="urlindex.idx"
try:
    afile = open(fn, 'r')
except IOError:
    afile = open(fn, 'w')

for f in files:
    formation=repouri + "/" + f
    #print(computeMD5hash(formation))
    md5url=computeMD5hash(formation)
    hashlist = afile.readlines()
    for line in hashlist:
        if str(md5url) in line:
            print ("Skipping " + formation + " because its already scanned and indexed as  " + line)
        else:
            if downloadengine(formation):
                print ("Download completed " + formation)
                print ("Starting to write to database..")
                #writetodatabase()
                print ("Writing hash value ..")
                writefilehash(formation)

print("Closing..")
afile.close()

Answer 1

您正在循环测试。对于每条不匹配的行，请下载：

line1
    if hash in line:
        print something
    else
        download
line2
    if hash in line:
        print something
    else
        download
line3
    if hash in line:
        print something
    else
        download

如果散列在第1行中，则您仍将下载，因为散列不在第2行或第3行中。除非测试了所有行，否则您不应该决定下载。

>
执行此操作的最佳方法是一次性将所有哈希读取到集合对象中（因为对集合进行包含性测试更快）。删除行分隔符：

try: with open(fn) as hashfile: hashes = {line.strip() for line in hashfile} except IOError: # no file yet, just use an empty set hashes = set()

然后在测试新哈希时使用：

urlhash = computeMD5hash(formation) if urlhash not in hashes: # not seen before, download # record the hash hashes.add(urlhash) with open(fn, 'a') as hashfile: hashfile.write(urlhash + '\n')

检查索引文件中是否存在MD5值

1 个答案: