Question

程序：应该读取文本并查找前十个最常用的单词并按频率对它们进行排序，然后按顺序打印列表。（当调用＆＃34; - topcount＆＃34;标志时）会发生这种情况

我试图略微修改这个程序，以便在按频率从文本中找到前10个最常用的单词后，然后按字母顺序对列表进行排序并打印出来，这样它就按字母顺序而不是数字顺序排列。

当前代码：

import sys

def word_dictionary(filename):
  word_count = {} #create dict
  input = open(filename, 'r')
  for line in input:
    words = line.split()#split lines on whitespace
    for word in words:
      word = word.lower() #forces all found words into lowercase
      if not word in word_count:
        word_count[word] = 1
      else:
        word_count[word] = word_count[word] + 1
  input.close()
  return word_count


def print_words(filename):
  word_count = word_dictionary(filename)
  words = sorted(word_count.keys())
  for word in words:
    print word, word_count[word]


def get_count(word_count_tuple):
  return word_count_tuple[1]

def print_top(filename):
  word_count = word_dictionary(filename)
  items = sorted(word_count.items(), key=get_count, reverse=True)
  for item in items[:20]:
    print item[0], item[1]

def main():
  if len(sys.argv) != 3:
    print 'usage: ./wordcount.py {--count | --topcount} file'
    sys.exit(1)

  option = sys.argv[1]
  filename = sys.argv[2]
  if option == '--count':
    print_words(filename)
  elif option == '--topcount':
    print_top(filename)
  else:
    print 'unknown option: ' + option
    sys.exit(1)

if __name__ == '__main__':
  main()

我试图这样做：

def get_alph(word_count_tuple):
  return word_count_tuple[0]

替换＆＃34; def get_count（word_count_tuple）＆＃34;功能，并修改＆＃34;打印顶部＆＃34;功能使

  items = sorted(word_count.items(), key = get_alph)

按字母顺序列出一个列表，但它没有按预期工作，而是打印出按字母顺序排序的文本中所有单词列表的前10个单词。

有什么建议可以帮助我们按照预期的方式运作吗？

Answer 1

对已排序单词的切片进行排序：

def print_top(filename):
    word_count = word_dictionary(filename)
    items = sorted(word_count.items(), key=get_count, reverse=True)
    for item in sorted(items[:20]):
        print item[0], item[1]

首先按items中的计数生成排序列表，然后再按字母顺序对排序列表的前20个排序。

由于您的items是(word, count)元组，因此您不需要此处的排序键;元组也按字典顺序排序;首先在第一个值上比较两个元组，并且仅在相等时，按第二个值等进行比较

请注意，如果您只需要此处的前K项，则对整个word_count项目列表进行排序是过度的。请改用此处的heapq.nlargest() function;它是O（NlogK）算法而不是O（NlogN）;对于大N（大量单词），这可以产生显着差异：

import heapq

def print_top(filename):
    word_count = word_dictionary(filename)
    items = heapq.nlargest(word_count.items(), key=get_count)
    for item in sorted(items[:20]):
        print item[0], item[1]

Python在按数字排序后按字母顺序对列表进行排序

1 个答案: