Question

我的初始数据框：

                           text  n_gram_len
0         This is the best text           2
1  This is some other best text           1
2   Well this is something else           3

所需的输出：

                                              n_gram  
0             [This is, is the, the best, best text]  
1                [This, is, some, other, best, text]  
2  [Well this is, this is something, is something...

我的代码：

from nltk import ngrams
df['n_gram'] = df[['text', 'n_gram_len']].apply(lambda x: [' '.join(x) for x in ngrams(x.text.split(), x.n_gram_len)], axis=1)

问题是我有一个数据框，其中包含100000个字符串，并且拆分的长度。我的代码运行了50秒钟。有没有更好的方法可以做到这一点或提高当前代码的效率。

一旦创建n-gram，我就会创建更多的值。我将list用作存储n-gram的数据结构。我还有其他存储方式可以减少进一步的处理时间。

根据另一列中的值生成大量数据的n-gram

0 个答案: