Question

我想使用相当大的语料库。它的名字是web 1T-gram。它有大约3万亿个令牌。这是我第一次使用redis，我正在尝试编写所有关键：值对，但是它花费的时间太长了。我的最终目标是使用几个redis实例来存储语料库，但是，现在，我仍然坚持在一个实例上编写它。

我不确定但是有没有办法加速写作过程？截至目前，我只在具有64G RAM的机器中的单个redis实例上编写。我在想是否有一些缓存大小的设置，我可以最大限度地用于redis。或者那些线上的东西？

感谢。

作为参考，我编写了以下代码：

import gzip
import redis
import sys
import os
import time
import gzip
r = redis.StrictRedis(host='localhost',port=6379,db=0)
startTime = time.time()
for l in os.listdir(sys.argv[1]):
        infile = gzip.open(os.path.join(sys.argv[1],l),'rb')
        print l
        for line in infile:
                parts = line.split('\t')
                #print parts[0],' ',parts[1]
                r.set(parts[0],int(parts[1].rstrip('\n')))
r.bgsave()
print time.time() - startTime, ' seconds '

更新：

我读过关于大量插入的内容，并且一直试图这样做，但这也一直在失败。以下是脚本中的更改：

def gen_redis_proto(*args):
    proto = ''
    proto += '*' + str(len(args)) + '\r\n'
    for arg in args:
        proto += '$' + str(len(arg)) + '\r\n'
        proto += str(arg) + '\r\n'
    return proto
import sys
import os
import gzip
outputFile = open(sys.argv[2],'w')



for l in os.listdir(sys.argv[1]):
        infile = gzip.open(os.path.join(sys.argv[1],l),'rb')
        for line in infile:
                parts = line.split('\t')
                key = parts[0]
                value = parts[1].rstrip('\n')
                #outputFile.write(gen_redis_proto('SET',key,value))
                print gen_redis_proto('SET',key,value)

        infile.close()
        print 'done with file ',l

生成方法的功劳归于github用户。我没写。

如果我这样做，

ERR wrong number of arguments for 'set' command
ERR unknown command '$18'
ERR unknown command 'ESSPrivacyMark'
ERR unknown command '$3'
ERR unknown command '225'
ERR unknown command ' *3'
ERR unknown command '$3'
ERR wrong number of arguments for 'set' command
ERR unknown command '$25'
ERR unknown command 'ESSPrivacyMark'
ERR unknown command '$3'
ERR unknown command '157'
ERR unknown command ' *3'
ERR unknown command '$3'

这种情况一直持续下去。输入格式为

“string”\ t count。

感谢。

第二次更新：

我使用了流水线，这确实给了我一个提升。但很快就失去了记忆。作为参考，我有一个64 gig RAM的系统。而且我认为它不会耗尽记忆力。代码如下：

import redis
import gzip
import os
import sys
r = redis.Redis(host='localhost',port=6379,db=0)
pipe = r.pipeline(transaction=False)
i = 0
MAX = 10000
ignore = ['3gm-0030.gz','3gm-0063.gz','2gm-0008.gz','3gm-0004.gz','3gm-0022.gz','2gm-0019.gz']
for l in os.listdir(sys.argv[1]):
        if(l in ignore):
                continue
        infile = gzip.open(os.path.join(sys.argv[1],l),'rb')
        print 'doing it for file ',l
        for line in infile:
                parts = line.split('\t')
                key = parts[0]
                value = parts[1].rstrip('\n')
                if(i<MAX):
                        pipe.set(key,value)
                        i=i+1
                else:   
                        pipe.execute()
                        i=0
                        pipe.set(key,value)
                        i=i+1
        infile.close()

是哈希的方式吗？我认为64演出就足够了。我只给了它一小部分20亿个关键值：价值对而不是整个事物。

Answer 1

在您的情况下，您可能无法实现所需。

根据this page，您的数据集是24 GB 压缩 gzip。这些文件可能包含很多类似的文本，如字典。

使用words程序中的dict文件进行快速测试，压缩率为3.12x：

> gzip -k -c /usr/share/dict/web2 > words.gz
> du /usr/share/dict/web2  words.gz
2496    /usr/share/dict/web2
800 words.gz
> calc '2496/800'
3.12 /* 3.12 */
> calc '3.12*24'
74.88 /* 7.488e1 */

因此，您的未压缩数据大小容易超过64 GB。因此，即使没有 Redis的任何开销，即使使用16位无符号整数来存储计数，它也不适合您的RAM。

查看示例，大多数键相对短;

serve as the incoming   92
serve as the incubator  99
serve as the independent    794
serve as the index  223
serve as the indication 72
serve as the indicator  120
serve as the indicators 45
serve as the indispensable  111
serve as the indispensible  40
serve as the individual 234
serve as the industrial 52

您可以对密钥进行哈希处理，但平均来说可能不会为您节省太多：

In [1]: from hashlib import md5

In [2]: data = '''serve as the incoming 92
serve as the incubator 99
serve as the independent 794
serve as the index 223
serve as the indication 72
serve as the indicator 120
serve as the indicators 45
serve as the indispensable 111
serve as the indispensible 40
serve as the individual 234
serve as the industrial 52'''

In [3]: lines = data.splitlines()

In [4]: kv = [s.rsplit(None, 1) for s in lines]

In [5]: kv[0:2]
Out[5]: [['serve as the incoming', '92'], ['serve as the incubator', '99']]

In [6]: [len(s[0]) for s in kv]
Out[6]: [21, 22, 24, 18, 23, 22, 23, 26, 26, 23, 23]

In [7]: [len(md5(s[0]).digest()) for s in kv]
Out[7]: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16]

对于任何短于16个字节的密钥，实际上成本会有更多空间来散列它。

压缩字符串通常不会节省空间，即使您忽略标题;

In [1]: import zlib

In [2]: zlib.compress('foo')[:3]
Out[2]: 'x\x9cK'

In [3]: zlib.compress('bar')[:3]
Out[3]: 'x\x9cK'

In [4]: s = 'serve as the indispensable'

In [5]: len(s)
Out[5]: 26

In [6]: len(zlib.compress(s))-3
Out[6]: 31

Answer 2

不是编写命令文件，也许你应该使用流水线操作和多处理。在redis-py中使用流水线非常简单。您需要运行测试才能找到理想的块大小。

有关Py-redis，多处理和流水线操作的示例，请查看此内容 example gist

Answer 3

我肯定会使用哈希，因为顶级密钥存在开销，因为它们存储了您可能不需要的其他数据（例如TTL ...）。

此外，redis.io网站还有一些performance tricks和Jerremy Zawodny stored 1.2 billion key/valuy pairs a while ago.

将3,795,790,711个唯一键：值对写入redis

3 个答案: