Question

我正在处理一个非常大的列表，大小约为56,000个元素（所有字符串）。我试图减少运行时间。

有没有办法缩短这一行： x = [int（i in list2）for list in list1]

给出一些单词词典（list1）和一些句子（list2），我试图创建一个基于句子的二进制表示，如 [1,0,0,0,0,0,1 ........ 0]其中1表示字典中的第i个单词出现在句子中。

我能以最快的方式做到这一点吗？

示例数据：

dictionary =  ['aardvark', 'apple','eat','I','like','maize','man','to','zebra', 'zed']
sentence = ['I', 'like', 'to', 'eat', apples']
result = [0,0,1,1,1,0,0,1,0,0]

Answer 1

set2 = set(list2)
x = [int(i in set2) for i in list1]

Answer 2

使用sets，总时间复杂度O(N)：

>>> sentence = ['I', 'like', 'to', 'eat', 'apples']
>>> dictionary =  ['aardvark', 'apple','eat','I','like','maize','man','to','zebra', 'zed']
>>> s= set(sentence)
>>> [int(word in s) for word in dictionary]
[0, 0, 1, 1, 1, 0, 0, 1, 0, 0]

如果您的句子列表中包含的实际句子不是单词，请尝试以下操作：

>>> sentences= ["foobar foo", "spam eggs" ,"monty python"]
>>> words=["foo", "oof", "bar", "pyth" ,"spam"]
>>> from itertools import chain

# fetch words from each sentence and create a flattened set of all words
>>> s = set(chain(*(x.split() for x in sentences)))

>>> [int(x in s) for x in words]
[1, 0, 0, 0, 1]

Answer 3

我会建议这样的事情：

words = set(['hello','there']) #have the words available as a set
sentance = ['hello','monkey','theres','there']
rep = [ 1 if w in words else 0 for w in sentance ]
>>> 
[1, 0, 0, 1]

我会采用这种方法，因为集合具有O（1）查找时间，用于检查w中words是否需要一个恒定时间。这导致列表理解为O（n），因为它必须访问每个单词一次。我相信这很接近或效率很高。

你还提到了创建一个'布尔'数组，这将允许你简单地改为：

rep = [ w in words for w in sentance ]
>>> 
[True, False, False, True]

基于在另一个列表中包含第i个元素，快速创建一个布尔数组

3 个答案: