Question

我想计算在文本文件中找到每个单词的次数，并且不确定是什么错误。我也很难找到一种方法来将单词也没有大写的情况包括在计数中

脚本需要两个命令行参数：输入的名称文件和阈值（整数）
输入文件每行仅包含一个单词，没有空格在单词之前或之后。该脚本不需要验证输入文件的内容。

输入文件中单词的字母大小写对于数数。例如，脚本应将“ the”，“ The”和 “ THE”同一个词。

在计算完单词后，脚本会打印报告（到文件， output.txt），其中列出了单词及其计数。每个字是仅当其计数大于或等于阈值时才打印在命令行上给出。

这是我的代码：

file = open(r"E:\number.txt", "r", encoding="utf-8-sig")

from collections import Counter
word_counter = Counter(file.read().split())

for item in word_counter.items():
    print("{}\t{}".format(*item))

file.close()

但是我想要以以下方式输出：

Answer 1

import re
import string
frequency = {}
file1 = open('s1.txt', 'r') # assuming the words are stored in s1.txt
text1 = file1.read().lower()
match_pattern = re.findall(r'[a-z]{1,189819}', text1)
# The longest word in English has 189,819 letters and would take you three and a half hours  
#to pronounce correctly. Seriously. It's the chemical name of Titin (or connectin), a giant protein  
#"that functions as a molecular spring which is responsible for the passive   elasticity of muscle.  


for word in match_pattern:
   count = frequency.get(word,0)
   frequency[word] = count + 1

frequency_list = frequency.keys()
for words in frequency_list:
   print words, frequency[words]

读取所有单词都转换为小写或大写的文件。
创建一个字典，将文件中的单词作为键，并将单词的频率作为其值。英语link

中最长的单词长度

Answer 2

或者与熊猫

import pandas as pd                                #Import Pandas
text1= pd.read_csv("E:\number.txt", header=None)   #Read text file    
s=pd.Series(text1[0]).str.lower()                  #convert to lowercase series
frequency_list = s.value_counts()                  #get frequencies of unique values

txt文件中的字数并输出到文件

2 个答案: