Question

我有一大堆文本，我想在其上运行几个算法。这些算法并不关心单词是什么 - 单词只是它们的唯一对象。因此，我想通过简单地用整数ID替换单词来减小文本的大小。

一个例子：

my_string = "an example sentence with an example repetition."
my_ids = get_ids_from_string(my_string)
print my_ids
>>> [0, 1, 2, 3, 0, 1, 4] ### note that the ID for 'example' is always the same

我正在寻找一种整洁，高效，pythonic的解决方法。

Answer 1

通过用整数替换字符串，你并没有获得太多的胜利 - 你通过确保相同的字符串只在内存中存储一次来获得同样的胜利。

my_string = "an example sentence with an example repetition."
words = my_string.split()
unique_words = [intern(word) for word in words]

“unique_words”列表等于“words”列表，但intern()保证字符串将被使用一次。如果你在一个包含更少可能单词的大型文本语料库上这样做，它将不会使用比整数更多的内存。

Answer 2

我想出的答案如下（您可以为多个id-generator重复使用next_i-function）：

from collections import defaultdict

COUNTER = defaultdict(lambda : -1)
def next_i(counter_id=0):
    """Return a new ID"""
    global COUNTER
    COUNTER[counter_id] += 1
    return COUNTER[counter_id]

id_generator = defaultdict(lambda : next_i(0))

my_string = "an example sentence with an example repetition."
my_ids = [id_generator[t] for t in my_string.lower().split()]
print my_ids
>>> [0, 1, 2, 3, 0, 1, 4]

在 600万个文档设置中，此算法以 45.56s 结束。

Answer 3

我使用fromkeys的{{1}}方法：

collections.OrderedDict

这是有效的，因为>>> from collections import OrderedDict >>> my_string = "an example sentence with an example repetition." >>> words = my_string.split() >>> code = {w: i for i,w in enumerate(OrderedDict.fromkeys(words))} >>> [code[w] for w in words] [0, 1, 2, 3, 0, 1, 4]将按照首次出现的顺序制作包含唯一字词的字典：

OrderedDict.fromkeys

由于这种方法有效，但显然表现不佳，可能：

>>> OrderedDict.fromkeys(words)
OrderedDict([('an', None), ('example', None), ('sentence', None), ('with', None), ('repetition.', None)])

Answer 4

我们有一个语料库（大约一百万份文件），我们为我的公司提供建议，因此减小尺寸对我们非常重要。我们得到的空间缩减的最大收获来自三个简单的事情，其中两个使用nltk：

使用默认punkt sentence tokenization将句子缩减为更有意义的符号。
删除所有stopwords，因为它们不包含任何有用的信息。
使用nltk porter stemmer。

这不是处理器效率最高的方法，但如果这是你的瓶颈，那么在空间方面会非常有效。如果句子标记器太慢（我认为它是最慢的部分），那么仍然可能值得做出词干并删除停用词。

简单的[pythonic]方法为给定的单词集创建id

4 个答案: