如何从语料库中获得最常用的单词?

时间:2017-03-03 08:35:30

标签: python python-2.7 nltk counter corpus

我正在使用语料库,并希望从语料库中获得最少和最少使用的单词和单词类。我有一个代码的开头,但我得到一些错误,我不知道如何处理。我想从棕色语料库中获得最常用的词,然后是最常用和最不常用的词类。我有这段代码:

import re
import nltk
import string
from collections import Counter
from nltk.corpus import stopwords
from collections import defaultdict, Counter
from nltk.corpus import brown

brown = nltk.corpus.brown
stoplist = stopwords.words('english')

from collections import defaultdict

def toptenwords(brown):
    words = brown.words()
    no_capitals = ([word.lower() for word in words])
    filtered = [word for word in no_capitals if word not in stoplist]
    translate_table = dict((ord(char), None) for char in string.punctuation)
    no_punct = [s.translate(translate_table) for s in filtered]
    wordcounter = defaultdict(int)
    for word in no_punct:
        if word in wordcounter:
            wordcounter[word] += 1
        else:
            wordcounter[word] = 1
    sorting = [(k, wordcounter[k])for k in sorted(wordcounter, key = wordcounter.get, reverse = True)]
    return sorting

print(toptenwords(brown))

words_2 = [word[0] for word in brown.tagged_words(categories="news")]
# the most frequent words
print Counter(words_2).most_common(10)

words_2 = [word[1] for word in brown.tagged_words(categories="news")]
# the most frequent word class
print Counter(words_2).most_common(10)


# Keeps words and pos into a dictionary
# where the key is a word and
# the value is a counter of POS and counts
word_tags = defaultdict(Counter)
for word, pos in brown.tagged_words():
word_tags[word][pos] +=1

# To access the POS counter.
print 'Red', word_tags['Red']
print 'Marlowe', word_tags['Marlowe']
print

# Greatest number of distinct tag.
word_with_most_distinct_pos = sorted(word_tags, key=lambda x: len(word_tags[x]), reverse=True)[0]

print word_with_most_distinct_pos
print word_tags[word_with_most_distinct_pos]
print len(word_tags[word_with_most_distinct_pos])

# which word has the greatest number of distinct tags
word_tags_2 = nltk.defaultdict(lambda: set())
for word, token in tagged_words:
    word_tags[word].add(token)
    ambig_words = sorted([(k, len(v)) for (k, v) in word_tags.items()]),
    key=itemgetter(1), reverse=True)[:50]
  print [(word, numtoks, word_tags[word]) for (word, numtoks) in ambig_words]

当我运行上面的代码时,我得到以下错误:

File "Oblig2a.py", line 64
    key=itemgetter(1), reverse=True)[:50]
                               ^
SyntaxError: invalid syntax

从这段代码我想得到:

  1. 最常见的字
  2. 最常用的词类
  3. 最不常用的词类
  4. 多个单词类的单词数
  5. 哪个词的标签最多,以及有多少个不同的标签
  6. 我需要帮助的最后一件事是将一个函数写入一个特定的单词并写下它与每个标签出现的次数。我试图在上面做,但我不能让它工作......
  7. 这是3号,4号,5号和6号我需要帮助...... 任何帮助都是最受欢迎的。

1 个答案:

答案 0 :(得分:0)

代码有3个问题:

  1. 解释器告诉您的错误 - 您应该提供停用词功能的语言名称:stoplist = stopwords.words('english')
  2. 使用defaultdict字典get方法对dict进行正确排序: [(k, wordcounter[k])for k in sorted(wordcounter, key = wordcounter.get, reverse = True)]
  3. 在Unicode数据上使用翻译表,请参阅string.translate() with unicode data in python
  4. 布朗标记的字词是(word, part-of-speech)
  5. 格式的元组

    完整代码:

    import re
    import nltk
    import string
    from collections import Counter
    from nltk.corpus import stopwords
    
    brown = nltk.corpus.brown
    stoplist = stopwords.words('english')
    
    from collections import defaultdict
    
    def toptenwords(brown):
        words = brown.words()
        no_capitals = set([word.lower() for word in words])
        filtered = [word for word in no_capitals if word not in stoplist]
        translate_table = dict((ord(char), None) for char in string.punctuation)
        no_punct = [s.translate(translate_table) for s in filtered]
        wordcounter = defaultdict(int)
        for word in no_punct:
            if word in wordcounter:
                wordcounter[word] += 1
            else:
                wordcounter[word] = 1
        sorting = [(k, wordcounter[k])for k in sorted(wordcounter, key = wordcounter.get, reverse = True)]
        return sorting
    
    
    print(toptenwords(brown))
    
    words_2 = [word[0] for word in brown.tagged_words(categories="news")]
    # the most frequent words
    print Counter(words_2).most_common(10)
    
    words_2 = [word[1] for word in brown.tagged_words(categories="news")]
    # the most frequent word class
    print Counter(words_2).most_common(10)