Question

有一个Python模块，它将文本文件作为参数，并计算该文本文件中单词长度的频率。

!/usr/bin/python3

import sys  
import string

def get_len(word):  
    punc = set(string.punctuation)
    clean_word = "".join(character for character in word if character not in punc)
    return len(clean_word)

try:
    with open(sys.argv[1], 'r') as file_arg:
        file_arg.read()
except IndexError:
    print('You need to provide a filename as an arguement.')
    sys.exit()

fname = open(sys.argv[1], 'r')
words = fname.read().split()


word_length_count = {}

for word in words:
    word_length = get_len(word)
    if word_length in word_length_count.keys():
        word_length_count[word_length] += 1
    else:
        word_length_count[word_length] = 1

print('Length', 'Count')

for key in word_length_count.keys():
    if key > 0:    
        print("      %d    %d" % (key, word_length_count[key]))

fname.close()

我希望将输出转换为基于文本的直方图，但不确定从哪里开始。以下示例：

Length Count  
      1    16   
      2    267  
      3    267  
      4    169  
      5    140  
      6    112  
      7    99  
      8    68  
      9    61  
      10    56  
      11    35  
      12    13  
      13    9  
      14    7  
      15    2  

  400 -|                                             
       |                                             
       |                                             
       |                                             
       |                                             
  300 -|                                             
       |                                             
       |   ******                                    
       |   ******                                    
       |   ******                                    
  200 -|   ******                                    
       |   ******                                    
       |   *********                                 
       |   ************                              
       |   ************                              
  100 -|   ***************                           
       |   ******************                        
       |   ************************                  
       |   ***************************               
       |   ******************************            
    0 -+-+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+-
       | 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16

Answer 1

您可以编写一个函数，在给定高度返回直方图的一条水平线，如果每列高于或等于该高度，则输出*，否则输出空格：

def get_histogram_line(height, max_length):
    s = "";
    for i in range(0, max_length + 1):
       if word_length_count[i] >= height:
            s += "***"
       else:
            s += "   "
    return s

然后迭代高度值的范围，从最大值开始然后减小：

for h in range(400, 0, -20):
    print get_histogram_line(h, 15)

输出：

   ******                                    
   ******                                    
   ******                                    
   ******                                    
   ******                                    
   *********                                 
   ************                              
   ************                              
   ***************                           
   ******************                        
   ************************                  
   ***************************               
   ******************************

然后为标签等添加额外的格式。您还可以根据数据计算最大高度和步数，而不是硬编码。

将字典图表转换为文本直方图

1 个答案: