我想将此系列转换为包含每个评论中存在的每个单词的唯一列表,例如:
s = [[['the','pizza','was','okay'...],['i','love','this','place','my','fiance ','and','i','go'...]]
预期输出:
s = ['the','pizza','was','okay ...,'i','love','this','place','my','fiance','和','i','go'...]
我尝试使用tolist()
并进行一些循环,但是我肯定缺少一些东西。什么是解决这个问题的好方法?
答案 0 :(得分:2)
结合使用列表理解和拼合:
out = [y for x in df['tokens'] for y in x]
或使用itertools.chain
:
from itertools import chain
out = list(chain.from_iterable(df['tokens']))
性能:
np.random.seed(123)
N = 10000
L = list('abcdefghijklmno')
df = (pd.DataFrame({'A': np.random.choice(L, N),
'B':np.random.randint(1000, size=N)})
.groupby('B')['A'].apply(list).to_frame('tokens'))
print (df)
In [269]: %timeit df['tokens'].sum()
15.1 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [270]: %timeit out = [y for x in df['tokens'] for y in x]
360 µs ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [271]: %timeit out = list(chain.from_iterable(df['tokens']))
215 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
答案 1 :(得分:1)
您可以简单地使用:
df['tokens'].sum()
因为它将加总所有列表。