我正在使用shove来避免将大字典加载到内存中。
from shove import Shove
lemmaDict = Shove('file://storage')
with open(str(sys.argv[1])) as lemmaCPT:\
for line in lemmaCPT:
line = line.rstrip('\n')
lineAr = string.split(line, ' ||| ')
lineKey = lineAr[0] + ' ||| ' + lineAr[1]
lineValue = lineAr[2]
print lineValue
lemmaDict[lineKey] = lineValue
但是,通过阅读lemmaCPT
,我将获得以下KeyError和Traceback。发生了什么事?
Traceback (most recent call last):
File "./stemmer.py", line 19, in <module>
lemmaDict[lineKey] = lineValue
File "/opt/Python-2.7.6/lib/python2.7/site-packages/shove/core.py", line 44, in __setitem__
self.sync()
File "/opt/Python-2.7.6/lib/python2.7/site-packages/shove/core.py", line 74, in sync
self._store.update(self._buffer)
File "/opt/Python-2.7.6/lib/python2.7/_abcoll.py", line 542, in update
self[key] = other[key]
File "/opt/Python-2.7.6/lib/python2.7/site-packages/shove/base.py", line 123, in __setitem__
raise KeyError(key)
KeyError: '! ! ! \xd1\x87\xd0\xb8\xd1\x82\xd0\xb0\xd0\xb5\xd1\x82\xd1\x81\xd1\x8f \xd1\x82\xd1\x80\xd0\xbe\xd0\xb5\xd0\xba\xd1\x80\xd0\xb0\xd1\x82\xd0\xbd\xd1\x8b\xd0\xbc \xd0\xbf\xd0\xbe\xd0\xb2\xd1\x82\xd0\xbe\xd1\x80\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5\xd0\xbc \xd0\xbb\xd1\x8e\xd0\xb1\xd0\xbe\xd0\xb3\xd0\xbe ||| ! ! ! is pronounced by'
示例输入:
! ! ! читается троекратным повторением ||| ! ! ! is pronounced by repeating ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| ! ! ! is pronounced by ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| ! ! ! is pronounced ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| ! ! ! is ||| 0.00819374 8.53148e-39 0.00989281 0.0128612
! ! ! читается троекратным повторением ||| ! ! ! ||| 0.000119622 8.53148e-39 0.0098932 0.590703
! ! ! читается троекратным повторением ||| , ! ! ! is pronounced by ||| 0.00819374 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| , ! ! ! is pronounced ||| 0.00819374 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением ||| , ! ! ! is ||| 0.00819374 8.53148e-39 0.00989281 0.00154241
! ! ! читается троекратным повторением ||| , ! ! ! ||| 0.0074488 8.53148e-39 0.00989281 0.070842
! ! ! читается троекратным повторением любого ||| ! ! ! is pronounced by repeating ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39
! ! ! читается троекратным повторением любого ||| ! ! ! is pronounced by ||| 0.00744887 8.53148e-39 0.00989281 8.53148e-39
运行code.py sampleinput
将产生上述KeyError和Traceback。
答案 0 :(得分:2)
如果这是实际输入,则问题是长度为LemmaDict
和input
...
aftnix@dev:~⟫ cat input | wc -l
11
我改变了代码......
from shove import Shove
import sys
import string
lemmaDict = Shove('file://storage')
i = 0
with open(str(sys.argv[1])) as lemmaCPT:
for line in lemmaCPT:
line = line.rstrip('\n')
lineAr = string.split(line, ' ||| ')
lineKey = lineAr[0] + ' ||| ' + lineAr[1]
lineValue = lineAr[2]
print lineValue
print len(lemmaDict)
#print len(lemmaCPT)
i+=1
print i
#lemmaDict[lineKey] = lineValue
提供以下输出......
0.00744887 8.53148e-39 0.00989281 8.53148e-39
9
1
0.00744887 8.53148e-39 0.00989281 8.53148e-39
9
2
0.00744887 8.53148e-39 0.00989281 8.53148e-39
9
3
0.00819374 8.53148e-39 0.00989281 0.0128612
9
4
0.000119622 8.53148e-39 0.0098932 0.590703
9
5
0.00819374 8.53148e-39 0.00989281 8.53148e-39
9
6
0.00819374 8.53148e-39 0.00989281 8.53148e-39
9
7
0.00819374 8.53148e-39 0.00989281 0.00154241
9
8
0.0074488 8.53148e-39 0.00989281 0.070842
9
9
0.00744887 8.53148e-39 0.00989281 8.53148e-39
9
10
0.00744887 8.53148e-39 0.00989281 8.53148e-39
9
所以你只是超越了Dict
。
如果从输入中删除两行,它将停止抛出异常。
我不知道推,但快速检查shell告诉我它总是返回一个键控线词。必须有一种方法来发展它...也许有一种方法或类似的东西......你应该更仔细地挖掘它的文件
我只是觉得你以错误的方式使用Shove
。
编辑:这有点奇怪......在查看Shove
代码之后,事实证明它应该在达到缓冲区限制时同步它的内存内容......
def __setitem__(self, key, value):
self._cache[key] = self._buffer[key] = value
# when buffer reaches self._limit, write buffer to store
if len(self._buffer) >= self._sync:
self.sync()
编辑2
我早些时候完全错了......但我有一些有趣的指针。其中一个问题是,shove
引发了一个令人困惑的例外......
真正的例外发生是因为......
def __setitem__(self, key, value):
118 # (per Larry Meyn)
119 try:
120 with open(self._key_to_file(key), 'wb') as item:
121 item.write(self.dumps(value))
122 except (IOError, OSError):
123 raise KeyError(key)
因此异常实际上来自open
系统调用。这意味着它在编写文件时遇到了麻烦。我对字符串的长度有了新的怀疑......
storage
文件夹的外观......
aftnix@dev:~⟫ ls -l storage/
total 36
-rw-rw-r-- 1 aftnix aftnix 49 ডিসে 4 01:35 %21+%21+%21+%D1%87%D0%B8%D1%82%D0%B0%D0%B5%D1%82%D1%81%D1%8F+%D1%82%D1%80%D0%BE%D0%B5%D0%BA%D1%80%D0%B0%D1%82%D0%BD%D1%8B%D0%BC+%D0%BF%D0%BE%D0%B2%D1%82%D0%BE%D1%80%D0%B5%D0%BD%D0%B8%D0%B5%D0%BC+%7C%7C%7C+%21+%21+%21
-rw-rw-r-- 1 aftnix aftnix 52 ডিসে 4 01:35 %21+%21+%21+%D1%87%D0%B8%D1%82%D0%B0%D0%B5%D1%82%D1%81%D1%8F+%D1%82%D1%80%D0%BE%D0%B5%D0%BA%D1%80%D0%B0%D1%82%D0%BD%D1%8B%D0%BC+%D0%BF%D0%BE%D0%B2%D1%82%D0%BE%D1%80%D0%B5%D0%BD%D0%B8%D0%B5%D0%BC+%7C%7C%7C+%2C+%21+%21+%21+is+pronounced
所以shove
使用密钥作为文件名。所以它可能变得非常难看,因为你的字符串在最后两个条目中非常大,尤其是倒数第二个条目。因此,对于测试,我从输入的最后两行删除了一些字符。代码按预期运行,没有任何例外。
Linux内核的文件名长度限制为....
aftnix@dev:~⟫ cat /usr/include/linux/limits.h
#ifndef _LINUX_LIMITS_H
#define _LINUX_LIMITS_H
#define NR_OPEN 1024
#define NGROUPS_MAX 65536 /* supplemental group IDs are available */
#define ARG_MAX 131072 /* # bytes of args + environ for exec() */
#define LINK_MAX 127 /* # links a file may have */
#define MAX_CANON 255 /* size of the canonical input queue */
#define MAX_INPUT 255 /* size of the type-ahead buffer */
#define NAME_MAX 255 /* # chars in a file name */
因此,为了解决这个问题,你必须做点别的事情。您无法将解析的香草键放入Shove
。