我想计算在文本文件中找到每个单词的次数,并且不确定是什么错误。我也很难找到一种方法来将单词也没有大写的情况包括在计数中
输入文件每行仅包含一个单词,没有空格 在单词之前或之后。该脚本不需要验证 输入文件的内容。
输入文件中单词的字母大小写对于 数数。例如,脚本应将“ the”,“ The”和 “ THE”同一个词。
在计算完单词后,脚本会打印报告(到文件, output.txt),其中列出了单词及其计数。每个字是 仅当其计数大于或等于阈值时才打印 在命令行上给出。
这是我的代码:
file = open(r"E:\number.txt", "r", encoding="utf-8-sig")
from collections import Counter
word_counter = Counter(file.read().split())
for item in word_counter.items():
print("{}\t{}".format(*item))
file.close()
但是我想要以以下方式输出:
答案 0 :(得分:0)
import re
import string
frequency = {}
file1 = open('s1.txt', 'r') # assuming the words are stored in s1.txt
text1 = file1.read().lower()
match_pattern = re.findall(r'[a-z]{1,189819}', text1)
# The longest word in English has 189,819 letters and would take you three and a half hours
#to pronounce correctly. Seriously. It's the chemical name of Titin (or connectin), a giant protein
#"that functions as a molecular spring which is responsible for the passive elasticity of muscle.
for word in match_pattern:
count = frequency.get(word,0)
frequency[word] = count + 1
frequency_list = frequency.keys()
for words in frequency_list:
print words, frequency[words]
读取所有单词都转换为小写或大写的文件。
创建一个字典,将文件中的单词作为键,并将单词的频率作为其值。英语link
答案 1 :(得分:0)
或者与熊猫
import pandas as pd #Import Pandas
text1= pd.read_csv("E:\number.txt", header=None) #Read text file
s=pd.Series(text1[0]).str.lower() #convert to lowercase series
frequency_list = s.value_counts() #get frequencies of unique values