我试图找到一种方法来验证我的代码是否可以交叉验证索引文件中url字符串的md5转换值的存在,如果是,则跳过扫描。
下面是我的代码
形成的URL被转换为md5字符串,然后在扫描完成后存储在idx文件中,目标是将来的扫描不应获取相同的URL。我看到的问题是if str(md5url) in line
没有得到执行,可能是因为在将哈希添加到文件时没有使用'\ n'作为后缀。但是我尝试过它仍然不起作用。
有什么想法吗?
def computeMD5hash(string_for_hash):
m = hashlib.md5()
m.update(string_for_hash.encode('utf-8'))
return m.hexdigest()
def writefilehash(formation_URL):
fn="urlindex.idx"
try:
afile = open(fn, 'a')
afile.write(computeMD5hash(formation_URL))
afile.close()
except IOError:
print("Error writing to the index file")
fn="urlindex.idx"
try:
afile = open(fn, 'r')
except IOError:
afile = open(fn, 'w')
for f in files:
formation=repouri + "/" + f
#print(computeMD5hash(formation))
md5url=computeMD5hash(formation)
hashlist = afile.readlines()
for line in hashlist:
if str(md5url) in line:
print ("Skipping " + formation + " because its already scanned and indexed as " + line)
else:
if downloadengine(formation):
print ("Download completed " + formation)
print ("Starting to write to database..")
#writetodatabase()
print ("Writing hash value ..")
writefilehash(formation)
print("Closing..")
afile.close()
答案 0 :(得分:1)
您正在循环测试 。对于每条不匹配的行,请下载:
line1
if hash in line:
print something
else
download
line2
if hash in line:
print something
else
download
line3
if hash in line:
print something
else
download
如果散列在第1行中,则您仍将下载,因为散列不在第2行或第3行中。除非测试了所有行,否则您不应该决定下载。
>执行此操作的最佳方法是一次性将所有哈希读取到集合对象中(因为对集合进行包含性测试更快)。删除行分隔符:
try:
with open(fn) as hashfile:
hashes = {line.strip() for line in hashfile}
except IOError:
# no file yet, just use an empty set
hashes = set()
然后在测试新哈希时使用:
urlhash = computeMD5hash(formation)
if urlhash not in hashes:
# not seen before, download
# record the hash
hashes.add(urlhash)
with open(fn, 'a') as hashfile:
hashfile.write(urlhash + '\n')