我实现了一个创建词汇表的代码,如下所示。 q
和a
为list of string
,s
为list of list of string
。尽管它很有效,但它真的很慢,因为data
非常大。我认为这段代码并不那么聪明。
你认为有没有办法优雅地实施?
vocab = functools.reduce(lambda x, y: x | y, (set(list(chain.from_iterable(s)) + q + a) for s, q, a in data))
我写了一个简单的程序来测试这段代码。如下所示,data
长度在实际数据集中非常大。
import time
import functools
from itertools import chain
s1 = [
['a', 'b', 'fwa'], # actual length is around 10
['foo', 'ixb', 'fwa'],
['fj', 'fab', 'fwa']
]
q1 = ['fwa', 'fawh'] # actual length is around 10
a1 = ['fjj', 'jfaw'] # actual length is around 3
data = []
for i in range(10000000):
data.append((s1, q1, a1))
start = time.time()
vocab = functools.reduce(lambda x, y: x | y, (set(list(chain.from_iterable(s)) + q + a) for s, q, a in data)) # my way
elapsed_time = time.time() - start
print(elapsed_time) # 11.522738695144653
print(vocab)
start = time.time()
vocab = functools.reduce(lambda x, y: x | y, (set(chain(chain.from_iterable(s), q, a)) for s, q, a in data)) # @cowbert
elapsed_time = time.time() - start
print(elapsed_time) # 9.918306350708008
print(vocab)