我有一个输入文本文件,我必须从中计算字符总数,行数和每个单词的总和。
到目前为止,我已经能够获得字符,行和单词的数量。我还将文本转换为全部小写,因此对于相同的单词我没有得到2个不同的计数,其中一个是小写,另一个是大写。
现在看着输出我意识到,单词的数量不是很干净。我一直在努力输出干净的数据,它不计算任何特殊字符,并且在计算单词时不包括句号或逗号。
Ex. if the text file contains the line: "Hello, I am Bob. Hello to Bob *"
it should output:
2 Hello
2 Bob
1 I
1 am
1 to
Instead my code outputs
1 Hello,
1 Hello
1 Bob.
1 Bob
1 I
1 am
1 to
1 *
以下是我现在的代码。
# Open the input file
fname = open('2013_honda_accord.txt', 'r').read()
# COUNT CHARACTERS
num_chars = len(fname)
# COUNT LINES
num_lines = fname.count('\n')
#COUNT WORDS
fname = fname.lower() # convert the text to lower first
words = fname.split()
d = {}
for w in words:
# if the word is repeated - start count
if w in d:
d[w] += 1
# if the word is only used once then give it a count of 1
else:
d[w] = 1
# Add the sum of all the repeated words
num_words = sum(d[w] for w in d)
lst = [(d[w], w) for w in d]
# sort the list of words in alpha for the same count
lst.sort()
# list word count from greatest to lowest (will also show the sort in reserve order Z-A)
lst.reverse()
# output the total number of characters
print('Your input file has characters = ' + str(num_chars))
# output the total number of lines
print('Your input file has num_lines = ' + str(num_lines))
# output the total number of words
print('Your input file has num_words = ' + str(num_words))
print('\n The 30 most frequent words are \n')
# print the number of words as a count from the text file with the sum of each word used within the text
i = 1
for count, word in lst[:10000]:
print('%2s. %4s %s' % (i, count, word))
i += 1
由于
答案 0 :(得分:0)
尝试替换
words = fname.split()
使用
get_alphabetical_characters = lambda word: "".join([char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word])
words = list(map(get_alphabetical_characters, fname.split()))
让我解释一下代码的各个部分。
从第一行开始,只要您声明了表单
function_name = lambda argument1, argument2, ..., argumentN: some_python_expression
您正在查看的是一个没有任何副作用的函数的定义,这意味着它不能更改变量的值,它只能返回一个值。
所以get_alphabetical_characters
是一个我们知道的功能,因为它具有暗示名称,它只需要一个单词并只返回其中包含的字母字符。
这是使用"".join(some_list)
成语来完成的,该成语获取字符串列表并将它们连接起来(换句话说,它通过按给定顺序将它们连接在一起来生成单个字符串)。
此处的some_list
由生成器表达式[char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word]
它的作用是逐步浏览给定单词中的每个字符,如果它是alphebetical,或者如果它没有在其中放入一个空白字符串,则将其放入列表中的地方。
例如
[char if char in 'abcdefghijklmnopqrstuvwyz' else '' for char in "hello."]
评估以下列表:
['h','e','l','l','o','']
然后按
进行评估"".join(['h','e','l','l','o',''])
相当于
'h'+'e'+'l'+'l'+'o'+''
请注意,末尾添加的空白字符串不会产生任何影响。向任何字符串添加空字符串将再次返回相同的字符串。 而这最终会产生
"hello"
希望明白!
编辑#2:如果要包含用于标记十进制的句点,我们可以编写如下函数:
include_char = lambda pos, a_string: a_string[pos].isalnum() or a_string[pos] == '.' and a_string[pos-1:pos].isdigit()
words = "".join(map(include_char, fname)).split()
我们在这里做的是include_char
函数检查一个字符是否是"字母数字" (即是一个字母或一个数字)或它是一个句点,并且它前面的字符是数字,并使用此函数去掉我们想要的字符串中的所有字符,并将它们连接成一个字符串,然后我们使用str.split
方法将其分成一个字符串列表。
答案 1 :(得分:0)
此计划可能会对您有所帮助:
#I created a list of characters that I don't want \
# them to be considered as words!
char2remove = (".",",",";","!","?","*",":")
#Received an string of the user.
string = raw_input("Enter your string: ")
#Make all the letters lower-case
string = string.lower()
#replace the special characters with white-space.
for char in char2remove:
string = string.replace(char," ")
#Extract all the words in the new string (have repeats)
words = string.split(" ")
#creating a dictionary to remove repeats
to_count = dict()
for word in words:
to_count[word]=0
#counting the word repeats.
for word in to_count:
#if there is space in a word, it is white-space!
if word.isalpha():
print word, string.count(word)
如下工作:
>>> ================================ RESTART ================================
>>>
Enter your string: Hello, I am Bob. Hello to Bob *
i 1
am 1
to 1
bob 2
hello 2
>>>
另一种方法是使用正则表达式删除所有非字母字符(以摆脱char2remove
列表):
import re
regex = re.compile('[^a-zA-Z]')
your_str = raw_input("Enter String: ")
your_str = your_str.lower()
regex.sub(' ', your_str)
words = your_str.split(" ")
to_count = dict()
for word in words:
to_count[word]=0
for word in to_count:
if word.isalpha():
print word, your_str.count(word)