答案 0 :(得分:0)
首先,一些花哨的代码来生成DataFrame。
from io import StringIO
import pandas as pd
sio = StringIO("""I am just going to type up something because you inserted an image instead ctr+c and ctr+v the code to Stackoverflow.
Actually, it's unclear what you want to do with the ngram counts.
Perhaps, it might be better to use the `nltk.everygrams()` if you want a global count.
And if you're going to build some sort of ngram language model, then it might not be efficient to do it as you have done too.""")
with sio as fin:
texts = [line for line in fin]
df = pd.DataFrame({'text': texts})
然后,您可以轻松使用DataFrame.apply
来提取ngrams,例如
from collections import Counter
from functools import partial
from nltk import ngrams, word_tokenize
for i in range(1, 4):
_ngrams = partial(ngrams, n=i)
df['{}-grams'.format(i)] = df['text'].apply(lambda x: Counter(_ngrams(word_tokenize(x))))