我正在处理一个非常大的列表,大小约为56,000个元素(所有字符串)。 我试图减少运行时间。
有没有办法缩短这一行: x = [int(i in list2)for list in list1]
给出一些单词词典(list1)和一些句子(list2), 我试图创建一个基于句子的二进制表示,如 [1,0,0,0,0,0,1 ........ 0]其中1表示字典中的第i个单词出现在句子中。
我能以最快的方式做到这一点吗?
示例数据:
dictionary = ['aardvark', 'apple','eat','I','like','maize','man','to','zebra', 'zed']
sentence = ['I', 'like', 'to', 'eat', apples']
result = [0,0,1,1,1,0,0,1,0,0]
答案 0 :(得分:1)
set2 = set(list2)
x = [int(i in set2) for i in list1]
答案 1 :(得分:0)
使用sets
,总时间复杂度O(N)
:
>>> sentence = ['I', 'like', 'to', 'eat', 'apples']
>>> dictionary = ['aardvark', 'apple','eat','I','like','maize','man','to','zebra', 'zed']
>>> s= set(sentence)
>>> [int(word in s) for word in dictionary]
[0, 0, 1, 1, 1, 0, 0, 1, 0, 0]
如果您的句子列表中包含的实际句子不是单词,请尝试以下操作:
>>> sentences= ["foobar foo", "spam eggs" ,"monty python"]
>>> words=["foo", "oof", "bar", "pyth" ,"spam"]
>>> from itertools import chain
# fetch words from each sentence and create a flattened set of all words
>>> s = set(chain(*(x.split() for x in sentences)))
>>> [int(x in s) for x in words]
[1, 0, 0, 0, 1]
答案 2 :(得分:0)
我会建议这样的事情:
words = set(['hello','there']) #have the words available as a set
sentance = ['hello','monkey','theres','there']
rep = [ 1 if w in words else 0 for w in sentance ]
>>>
[1, 0, 0, 1]
我会采用这种方法,因为集合具有O(1)查找时间,用于检查w
中words
是否需要一个恒定时间。这导致列表理解为O(n),因为它必须访问每个单词一次。我相信这很接近或效率很高。
你还提到了创建一个'布尔'数组,这将允许你简单地改为:
rep = [ w in words for w in sentance ]
>>>
[True, False, False, True]