在不知道python中的单词的情况下计算数组数组中单词出现次数

时间:2013-01-19 15:15:24

标签: python arrays count nltk

我是python编程的新手,我希望你们中的某个人可以帮助我。

我必须以这种形式打印语料库的前十个双字母:

((token),(POS_tag),(token),(POS_tag))

其中每个标记的并发性必须大于2.

所以我已经完成了一个pos标记的令牌列表,并与bigrams()配对。

如何检查每个单词的出现次数(每对的标记对应)是否> 2?

2 个答案:

答案 0 :(得分:0)

由于各种原因,你的问题很模糊。首先,标题可能措辞得更好。你没有很好地解释你想做什么。对于“前十首双十字军”,你真的是指文本中的第一个双十字军,还是十个最常见的双十字军?我认为它是后者,但如果不是,只需删除排序并将文本限制为前11个单词。

from nltk.util import bigrams
from nltk import tokenize, pos_tag
from collections import defaultdict

counts = defaultdict(int)
counts_pos = defaultdict(int)

with open('twocities.txt') as f:
    txt = f.read().lower()
    txt = tokenize.word_tokenize(txt)

    # Generate the lexical bigrams
    bg = bigrams(txt)

    # Do part-of-speech tagging and generate 
    # lexical+pos bigrams
    pos = pos_tag(txt)
    bg_pos = bigrams(pos)

    # Count the number of occurences of each unique bigram
    for bigram in bg:
        counts[bigram] += 1

    for bigram in bg_pos:
        counts_pos[bigram] += 1

# Make a list of bigrams sorted on number of occurrences
sortedbigrams = sorted(counts, key = lambda x: counts[x], reverse=True)
sortedbigrams_pos = sorted(counts_pos, key = lambda x: counts_pos[x],
                           reverse=True)

# Remove bigrams that occur less than the given threshold
print 'Number of bigrams before thresholding: %i, %i' % \
       (len(sortedbigrams), len(sortedbigrams_pos))

min_occurence = 2

sortedbigrams = [x for x in sortedbigrams if counts[x] > min_occurence]
sortedbigrams_pos = [x for x in sortedbigrams_pos if
            counts_pos[x] > min_occurence]
print 'Number of bigrams after thresholding: %i, %i\n' % \
       (len(sortedbigrams), len(sortedbigrams_pos))

# print results
print 'Top 10 lexical bigrams:'
for i in range(10):
    print sortedbigrams[i], counts[sortedbigrams[i]]

print '\nTop 10 lexical+pos bigrams:'
for i in range(10):
    print sortedbigrams_pos[i], counts_pos[sortedbigrams_pos[i]]

我的nltk安装仅适用于Python 2.6,如果我在2.7上安装它,我会使用Counter而不是defaultdict

A Tale Of Two Cities的第一页上使用此脚本,我得到以下输出:

Top 10 lexical bigrams:
(',', 'and') 17
('it', 'was') 12
('of', 'the') 11
('in', 'the') 11
('was', 'the') 11
(',', 'it') 9
('and', 'the') 6
('with', 'a') 6
('on', 'the') 5
(',', 'we') 4

Top 10 lexical+pos bigrams:
((',', ','), ('and', 'CC')) 17
(('it', 'PRP'), ('was', 'VBD')) 12
(('in', 'IN'), ('the', 'DT')) 11
(('was', 'VBD'), ('the', 'DT')) 11
(('of', 'IN'), ('the', 'DT')) 11
((',', ','), ('it', 'PRP')) 9
(('and', 'CC'), ('the', 'DT')) 6
(('with', 'IN'), ('a', 'DT')) 6
(('on', 'IN'), ('the', 'DT')) 5
(('and', 'CC'), ('a', 'DT')) 4

答案 1 :(得分:0)

我认为你的意思是前10个双桅杆,并且我排除了其中一个代币是标点符号的双字母组。

import nltk, collections, string
import nltk.book

def bigrams_by_word_freq(tokens, min_freq=3):
    def unique(seq): # http://www.peterbe.com/plog/uniqifiers-benchmark
        seen = set()
        seen_add = seen.add
        return [x for x in seq if x not in seen and not seen_add(x)]

    punct = set(string.punctuation)
    bigrams = unique(nltk.bigrams(tokens))
    pos = dict(nltk.pos_tag(tokens))
    count = collections.Counter(tokens)

    bigrams = filter(lambda (a,b): not punct.intersection({a,b}) and count[a] >= min_freq and count[b] >= min_freq, bigrams)

    return tuple((a,pos[a],b,pos[b]) for a,b in bigrams)



text = """Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall.
All the king's horses and all the king's men
Couldn't put Humpty together again."""

print bigrams_by_word_freq(nltk.wordpunct_tokenize(text), min_freq=2)

print bigrams_by_word_freq(nltk.book.text6)[:10]

输出:

(('Humpty', 'NNP', 'Dumpty', 'NNP'), ('the', 'DT', 'king', 'NN'))
(('SCENE', 'NNP', '1', 'CD'), ('clop', 'NN', 'clop', 'NN'), ('It', 'PRP', 'is', 'VBZ'), ('is', 'VBZ', 'I', 'PRP'), ('son', 'NN', 'of', 'IN'), ('from', 'IN', 'the', 'DT'), ('the', 'DT', 'castle', 'NN'), ('castle', 'NN', 'of', 'IN'), ('of', 'IN', 'Camelot', 'NNP'), ('King', 'NNP', 'of', 'IN'))