有没有一种方法可以分析文本文件以检查此条件

时间:2020-09-03 05:36:40

标签: python python-3.x python-2.7 file word

我需要创建一个程序来分析文件中的文本段落然后计数:

  • 多少个字
  • 一个单词的平均长度
  • 每个单词出现多少次
  • 每个字母开头多少个单词

到目前为止,我已经成功完成了前两个要点(如下所示)

fileName = open(input('Please enter the full name of the file: '), 'r') 
    w = [len(word) for line in fileName for word in line.rstrip().split(" ")]
    total_w = len(w)
    avg_w = sum(w) / total_w
    
    
  print('The total number of words in this file is:', total_w)
  print('The average length of the words in this file is:', avg_w)

1 个答案:

答案 0 :(得分:1)

collections.Counter使这一点相对简单。我使用re.findall(r'[\w]+', data)查找单词(单词是带有字母,下划线和数字的东西)。根据需要进行调整。

import re
from collections import Counter

fn = input('Please enter the full name of the file: ')
with open(fn, 'r') as f:
    words = Counter(re.findall(r'[\w]+', f.read()))
    # use words = Counter(f.read().split()) if everything split by spaces
    # adjust regular expression depending on whether you want or don't want
    # stuff like numbers to be counted as "words"

print('Total number of words:', sum(words.values()))
# this is weighted by word occurrence, not sure whether this is correct
print('Average length of words:', 
      sum(len(w) * o for w, o in words.items()) / sum(words.values()))
print('Word occurrence:', words)
# this only shows letters that actually occur. If you need all letters of 
# the alphabet, you have to add the rest
print('Start letter occurrence', Counter(w[0] for w in words.elements()))