Question

我有一个包含字符串的元组列表例如：

[('this', 'is', 'a', 'foo', 'bar', 'sentences')
('is', 'a', 'foo', 'bar', 'sentences', 'and')
('a', 'foo', 'bar', 'sentences', 'and', 'i')
('foo', 'bar', 'sentences', 'and', 'i', 'want')
('bar', 'sentences', 'and', 'i', 'want', 'to')
('sentences', 'and', 'i', 'want', 'to', 'ngramize')
('and', 'i', 'want', 'to', 'ngramize', 'it')]

现在我希望连接一个元组中的每个字符串以创建一个空格分隔字符串列表。我使用了以下方法：

NewData=[]
for grams in sixgrams:
       NewData.append( (''.join([w+' ' for w in grams])).strip())

工作得非常好。

但是，我拥有的列表有超过一百万个元组。所以我的问题是这种方法足够有效还是有更好的方法来做到这一点。感谢。

Answer 1

对于大量数据，您应该考虑是否需要将其全部保存在列表中。如果你一次处理每一个，你可以创建一个生成器，它将产生每个连接的字符串，但不会让它们全部占用内存：

new_data = (' '.join(w) for w in sixgrams)

如果你也可以从生成器获得原始元组，那么你也可以避免在内存中使用sixgrams列表。

Answer 2

列表理解创建临时字符串。只需使用' '.join代替。

>>> words_list = [('this', 'is', 'a', 'foo', 'bar', 'sentences'),
...               ('is', 'a', 'foo', 'bar', 'sentences', 'and'),
...               ('a', 'foo', 'bar', 'sentences', 'and', 'i'),
...               ('foo', 'bar', 'sentences', 'and', 'i', 'want'),
...               ('bar', 'sentences', 'and', 'i', 'want', 'to'),
...               ('sentences', 'and', 'i', 'want', 'to', 'ngramize'),
...               ('and', 'i', 'want', 'to', 'ngramize', 'it')]
>>> new_list = []
>>> for words in words_list:
...     new_list.append(' '.join(words)) # <---------------
... 
>>> new_list
['this is a foo bar sentences', 
 'is a foo bar sentences and', 
 'a foo bar sentences and i', 
 'foo bar sentences and i want', 
 'bar sentences and i want to', 
 'sentences and i want to ngramize', 
 'and i want to ngramize it']

以上for循环可以表示为以下列表理解：

new_list = [' '.join(words) for words in words_list]

Answer 3

你可以像这样有效地做到这一点

joiner = " ".join
print map(joiner, sixgrams)

我们仍然可以使用像这样的列表理解来提高性能

joiner = " ".join
print [joiner(words) for words in sixgrams]

性能比较表明，上面列出的列表理解解决方案比其他两个解决方案略快。

from timeit import timeit

joiner = " ".join

def mapSolution():
    return map(joiner, sixgrams)

def comprehensionSolution1():
    return ["".join(words) for words in sixgrams]

def comprehensionSolution2():
    return [joiner(words) for words in sixgrams]

print timeit("mapSolution()", "from __main__ import joiner, mapSolution, sixgrams")
print timeit("comprehensionSolution1()", "from __main__ import sixgrams, comprehensionSolution1, joiner")
print timeit("comprehensionSolution2()", "from __main__ import sixgrams, comprehensionSolution2, joiner")

我机器上的输出

1.5691678524
1.66710209846
1.47555398941

性能提升很可能是因为我们不必每次都从空字符串创建连接函数。

编辑虽然我们可以像这样改进性能，但最狡猾的方式是使用lvc's answer中的生成器。

在python的列表中连接元组的元素

3 个答案: