Question

我需要获取一些大型字符串文件，并在单独的文件中将每个字符串替换为1。每个文件中都有一些字符串的重复，文件之间有共同的字符串，因此需要获得相同的ID。我用字典实现了这个，但是由于文件的大小和字符串的数量，这个解决方案似乎运行缓慢。是否有更适合这种情况的数据结构或散列技术？

______________________编辑_______________________________________________________

我对dict的实现

index = {}
lastindex = 0
for row in reader:
    if row[0] not in index:
        lastindex += 1
        index[row[0]] = lastindex
    w.write(index[row[0]])

输入样本

feifei77.w70-e2.ezcname.com
reseauocoz.cluster007.ovh.net
cse-web-cl.comunique-se.com.br
ext-cust.squarespace.com
ext-cust.squarespace.com
ext-cust.squarespace.com
ext-cust.squarespace.com
ghs.googlehosted.com
isutility.web9.hubspot.com
sendv54sxu8f12g.ihance.net
sites.smarsh.io
www.triblocal.com.s3-website-us-east-1.amazonaws.com
*.2bask.com
*.819.cn

这应该返回

我应该澄清，它不一定需要以这种方式排序，尽管它确实需要包括从1到字符串数量的每个整数。 4 2 3 1 1 1 1 5 6 7 8 9 10也是有效的

Answer 1

使用set代替dict稍微更友好一点。使用https://docs.python.org/3/library/itertools.html unique_everseen()文档中的itertools示例，您可以执行以下操作：

for idx, word in enumerate(unique_everseen(reader), 1):
    print(idx)

可扩展到更大的数据集的替代方案是使用某种将数据存储在磁盘上的持久性密钥/值存储（而不是内存映射），例如，使用LevelDB（使用Plyvel），它看起来像这样：

import itertools
import plyvel

db = plyvel.DB('my-database', create_if_missing=True)
cnt = itertools.count(1)  # start counting at 1
for word in reader:
    key = word.encode('utf-8')
    value = db.get(key)
    if value is not None:
        # We've seen this word before.
        idx = int(value)
    else:
        # We've not seen this word before.
        idx = next(cnt)
        db.put(key, str(idx).encode('ascii'))

    print(idx)

Answer 2

代码的瓶颈是for循环中的w.write。首先生成dict，然后编写运行速度更快的文件。

比dict方式更快，为字符串python分配索引

2 个答案: