我有这个功能,我试着为我的目的编辑一点 但不是得到双胞胎我得到了unigrams。我需要添加或编辑什么? 我是python和nltk的新手
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
import re
def get_bigrams(myString):
tokenizer = WordPunctTokenizer()
tokens = tokenizer.tokenize(myString)
bigram_finder = BigramCollocationFinder.from_words(tokens)
bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)
for bigram_tuple in bigrams:
x = "%s %s" % bigram_tuple
tokens.append(x)
result = [x for x in tokens if x not in stopwords.words('english') and len(x) > 3]
return result
filename = raw_input('Enter File Name :')
word_list = re.split('\s+', file(filename).read().lower())
f=open ('test2.csv', 'w')
for line in word_list:
features = get_bigrams(line)
print features
f.write(str(line))
f.write("\n")
一个例子的输出"已经很长时间了#34;
It
has
been
a
long
time
然而我正在寻找像
这样的东西It has
has been
been a
a long
long time
答案 0 :(得分:1)
Nltk在这里看起来有些过分。为什么不这样做:
def pairs(seq):
return zip(seq, seq[1:])
s = "It has been a long time"
words = s.split()
for bigram in pairs(words):
print bigram
结果:
('It', 'has')
('has', 'been')
('been', 'a')
('a', 'long')
('long', 'time')
答案 1 :(得分:1)
我认为您的问题是如何处理文件读取和行处理:
以下行给出了一个单词列表(顾名思义)
word_list = re.split('\s+', file(filename).read().lower())
但稍后您将每个单词视为一行:
for line in word_list:
这只意味着您的代码无法正常工作。
如果我理解正确,您可能希望以下列方式更改文件读取:
filename = raw_input('Enter File Name :')
lines = file(filename).readlines()
f = open('test2.csv', 'w')
for line in lines:
features = get_bigrams(line)
# do more things
答案 2 :(得分:0)
您的功能get_bigrams
似乎对我有用,所以我认为问题是您的文件或您阅读的方式。
顺便说一句,我想为get_bigrams
建议更短的代码:
import nltk
def get_bigrams(sentence):
tokens = nltk.word_tokenize(sentence)
return zip(tokens, tokens[1:])
使用:
>>> [' '.join(b) for b in get_bigrams("It has been a long time")]
['It has', 'has been', 'been a', 'a long', 'long time']