我有以下示例数据框:
No category problem_definition_stopwords
175 2521 ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
211 1438 ['galley', 'work', 'table', 'stuck']
912 2698 ['cloth', 'stuck']
572 2521 ['stuck', 'coffee']
“ problem_definition_stopwords”字段已被标记化,并删除了终止词。
我想从“ problem_definition_stopwords”字段创建n-gram。具体来说,我想从我的数据中提取n-gram,并找到具有最高逐点互信息(PMI)的n-gram。
从本质上讲,我想找到同时出现的单词,这比我期望它们偶然获得的要多得多。
我尝试了以下代码:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
# errored out here
finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words(df['problem_definition_stopwords']))
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10)
我收到的错误是在第三部分代码上... TypeError:join()参数必须为str或字节,而不是'list'
编辑:DataFrame的更可移植的格式:
>>> df.columns
Index(['No', 'category', 'problem_definition_stopwords'], dtype='object')
>>> df.to_dict()
{'No': {0: 175, 1: 211, 2: 912, 3: 572}, 'category': {0: 2521, 1: 1438, 2: 2698, 3: 2521}, 'problem_definition_stopwords': {0: ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420'], 1: ['galley', 'work', 'table', 'stuck'], 2: ['cloth', 'stuck'], 3: ['stuck', 'coffee']}}
答案 0 :(得分:1)
您似乎并没有以正确的方式使用from_words
通话,而是看着help(nltk.corpus.genesis.words)
Help on method words in module nltk.corpus.reader.plaintext:
words(fileids=None) method of nltk.corpus.reader.plaintext.PlaintextCorpusReader instance
:return: the given file(s) as a list of words
and punctuation symbols.
:rtype: list(str)
(END)
这是您要找的吗?由于您已经将文档表示为字符串列表,根据我的经验,这很适合NLTK,所以我认为您可以使用from_documents
方法:
finder = BigramCollocationFinder.from_documents(
df['problem_definition_stopwords']
)
# only bigrams that appear 3+ times
# Note, I limited this to 1 since the corpus you provided
# is very small and it'll be tough to find repeat ngrams
finder.apply_freq_filter(1)
# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10)
[('brewing', 'properly'), ('galley', 'work'), ('maker', 'brewing'), ('properly', '2'), ('work', 'table'), ('coffee', 'maker'), ('2', '420'), ('cloth', 'stuck'), ('table', 'stuck'), ('420', '420')]