Question

有没有办法使用哈希函数（或类似的东西）在文件中生成唯一的位置，这样我就可以从这个位置轻松检索与此字符串对应的某些值：

>>> hash('abs')
-1600925517
>>> hash('cv')
-1537434339
>>> hash(112)
112
>>> hash('ANNC')
258026172
>>> hash('annc')
1415313084
>>> hash('an')
-1549758577
>>> hash('anc')
-1588925561
>>> hash('abs')
-1600925517

这样就像

def hash_location(string):
   return location

open_file=open(file_path,'r+')
our_string='something'
location=hash_location(our_string)
open_file.seek(location)
open_file.write(our_string)
open_file.close()

这样哈希值可以对应文件中的某些“正”位置，我只能在给定字符串的情况下计算

Answer 1

from random import random
from hashlib import sha1

file_ext = ".jpg"
unique_filename = sha1(str(random()).hexdigest() + file_ext

Answer 2

否 - 我的平台上的hash至少返回一个64位的数字，所以即使您只存储长度为1个字节的字符串，您仍然需要2 ** 64 = 16个磁盘空间的exbibytes。

您要解决的具体问题是什么？可能有更好的方法来实现你的目标。

修改

鉴于您需要存储10M +字符串，我建议使用以下https://serverfault.com/a/95454/98153

使用定义良好的散列算法（如MD5）而不是Python中的内置hash函数，这可能因平台或实现而异。

>>> import hashlib >>> hashlib.md5('test').hexdigest() '098f6bcd4621d373cade4e832627b4f6'

然后一次取3个字符以形成目录结构 - 每个目录最多提供16 * 16 * 16 = 4096个文件。所以在上面的例子中，你会使用

/098/f6bcd4621d373cade4e832627b4f6.txt

根据您的字符串长度，由于块文件系统存储，将字符串存储在单个文件中可能效率不高。所以在这个阶段你可以在文件中每行存储一个字符串，并在（非常小的）文件上搜索，例如：

/908/f6b.txt contains: cd4621d373cade4e832627b4f6 test 02ab5595859014ebf0951522d9 another string

您可能需要根据具体应用调整参数，但这似乎是一个很好的起点。

python哈希字符串到文件中的位置

2 个答案: