我编写了一段基本上计算字频率的代码,并将它们插入到ARFF文件中,以便与weka一起使用。我想改变它,以便它可以计算双字节频率,即单词对而不是单个单词,尽管我的尝试最多证明是不成功的。
我意识到有很多东西要看,但对此的任何帮助都非常感谢。 这是我的代码:
import re
import nltk
# Quran subset
filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')
# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
word_list = [punctuation.sub("", word) for word in word_list]
word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]
# create dictionary of word:frequency pairs
freq_dic = {}
for word in word_list2:
# form dictionary
try:
freq_dic[word] += 1
except:
freq_dic[word] = 1
print '-'*30
print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
freq_list2 = [(val, key) for key, val in freq_dic.items()]
# sort by val or frequency
freq_list2.sort(reverse=True)
freq_list3 = list(freq_list2)
# display result as top 10 most frequent words
freq_list4 =[]
freq_list4=freq_list3[:10]
words = []
for item in freq_list4:
a = str(item[1])
a = a.lower()
words.append(a)
f = open(filename)
newlist = []
for line in f:
line = punctuation.sub("", line)
line = line.lower()
newlist.append(line)
f2 = open('Lines.txt','w')
newlist2= []
for line in newlist:
line = line.split()
newlist2.append(line)
f2.write(str(line))
f2.write("\n")
print newlist2
# ARFF Creation
arff = open('output.arff','w')
arff.write('@RELATION wordfrequency\n\n')
for word in words:
arff.write('@ATTRIBUTE ')
arff.write(str(word))
arff.write(' numeric\n')
arff.write('@ATTRIBUTE class {endofworld, notendofworld}\n\n')
arff.write('@DATA\n')
# Counting word frequencies for each verse
for line in newlist2:
word_occurrences = str("")
for word in words:
matches = int(0)
for item in line:
if str(item) == str(word):
matches = matches + int(1)
else:
continue
word_occurrences = word_occurrences + str(matches) + ","
word_occurrences = word_occurrences + "endofworld"
arff.write(word_occurrences)
arff.write("\n")
print words
答案 0 :(得分:5)
这应该让你开始:
def bigrams(words):
wprev = None
for w in words:
yield (wprev, w)
wprev = w
请注意,第一个二元组是(None, w1)
,其中w1
是第一个单词,因此您有一个标记文本开头的特殊二元组。如果您还想要一个文本结束的双字母组,请在循环后添加yield (wprev, None)
。
答案 1 :(得分:3)
使用可选填充的广义到n-gram,也使用defaultdict(int)
作为频率,在2.6中工作:
from collections import defaultdict
def ngrams(words, n=2, padding=False):
"Compute n-grams with optional padding"
pad = [] if not padding else [None]*(n-1)
grams = pad + words + pad
return (tuple(grams[i:i+n]) for i in range(0, len(grams) - (n - 1)))
# grab n-grams
words = ['the','cat','sat','on','the','dog','on','the','cat']
for size, padding in ((3, 0), (4, 0), (2, 1)):
print '\n%d-grams padding=%d' % (size, padding)
print list(ngrams(words, size, padding))
# show frequency
counts = defaultdict(int)
for ng in ngrams(words, 2, False):
counts[ng] += 1
print '\nfrequencies of bigrams:'
for c, ng in sorted(((c, ng) for ng, c in counts.iteritems()), reverse=True):
print c, ng
输出:
3-grams padding=0
[('the', 'cat', 'sat'), ('cat', 'sat', 'on'), ('sat', 'on', 'the'),
('on', 'the', 'dog'), ('the', 'dog', 'on'), ('dog', 'on', 'the'),
('on', 'the', 'cat')]
4-grams padding=0
[('the', 'cat', 'sat', 'on'), ('cat', 'sat', 'on', 'the'),
('sat', 'on', 'the', 'dog'), ('on', 'the', 'dog', 'on'),
('the', 'dog', 'on', 'the'), ('dog', 'on', 'the', 'cat')]
2-grams padding=1
[(None, 'the'), ('the', 'cat'), ('cat', 'sat'), ('sat', 'on'),
('on', 'the'), ('the', 'dog'), ('dog', 'on'), ('on', 'the'),
('the', 'cat'), ('cat', None)]
frequencies of bigrams:
2 ('the', 'cat')
2 ('on', 'the')
1 ('the', 'dog')
1 ('sat', 'on')
1 ('dog', 'on')
1 ('cat', 'sat')
答案 2 :(得分:1)
我已经为你改写了第一位,因为它很吵。注意事项:
collections.Counter
很棒!好的,代码:
import re
import nltk
import collections
# Quran subset
filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
# create list of lower case words
word_list = re.split('\s+', open(filename).read().lower())
print 'Words in text:', len(word_list)
words = (punctuation.sub("", word).strip() for word in word_list)
words = (word for word in words if word not in ntlk.corpus.stopwords.words('english'))
# create dictionary of word:frequency pairs
frequencies = collections.Counter(words)
print '-'*30
print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
print frequencies
# display result as top 10 most frequent words
print frequencies.most_common(10)
[word for word, frequency in frequencies.most_common(10)]
答案 3 :(得分:1)
如果你开始使用NLTK的FreqDist函数进行计数,生活会容易得多。 NLTK也有bigram功能。两者的示例见下页。