需要帮助清除从文本文件中计算的单词

时间:2016-04-23 07:15:14

标签: python python-3.x

我有一个输入文本文件,我必须从中计算字符总数,行数和每个单词的总和。

到目前为止,我已经能够获得字符,行和单词的数量。我还将文本转换为全部小写,因此对于相同的单词我没有得到2个不同的计数,其中一个是小写,另一个是大写。

现在看着输出我意识到,单词的数量不是很干净。我一直在努力输出干净的数据,它不计算任何特殊字符,并且在计算单词时不包括句号或逗号。

Ex. if the text file contains the line: "Hello, I am Bob. Hello to Bob *"

it should output:
2 Hello
2 Bob
1 I
1 am
1 to

Instead my code outputs
1 Hello,
1 Hello
1 Bob.
1 Bob
1 I
1 am
1 to
1 *

以下是我现在的代码。

# Open the input file
fname = open('2013_honda_accord.txt', 'r').read()

# COUNT CHARACTERS
num_chars = len(fname)

# COUNT LINES 
num_lines = fname.count('\n')

#COUNT WORDS
fname = fname.lower() # convert the text to lower first
words = fname.split()
d = {}
for w in words:
    # if the word is repeated - start count
    if w in d:    
       d[w] += 1
    # if the word is only used once then give it a count of 1
    else:
       d[w] = 1

# Add the sum of all the repeated words 
num_words = sum(d[w] for w in d)

lst = [(d[w], w) for w in d]
# sort the list of words in alpha for the same count 
lst.sort()
# list word count from greatest to lowest (will also show the sort in reserve order Z-A)
lst.reverse()

# output the total number of characters
print('Your input file has characters = ' + str(num_chars))
# output the total number of lines
print('Your input file has num_lines = ' + str(num_lines))
# output the total number of words
print('Your input file has num_words = ' + str(num_words))

print('\n The 30 most frequent words are \n')

# print the number of words as a count from the text file with the sum of each word used within the text
i = 1
for count, word in lst[:10000]:
print('%2s.  %4s %s' % (i, count, word))
i += 1

由于

2 个答案:

答案 0 :(得分:0)

尝试替换

words = fname.split()

使用

get_alphabetical_characters = lambda word: "".join([char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word])
words = list(map(get_alphabetical_characters, fname.split()))

让我解释一下代码的各个部分。

从第一行开始,只要您声明了表单

function_name = lambda argument1, argument2, ..., argumentN: some_python_expression

您正在查看的是一个没有任何副作用的函数的定义,这意味着它不能更改变量的值,它只能返回一个值。

所以get_alphabetical_characters是一个我们知道的功能,因为它具有暗示名称,它只需要一个单词并只返回其中包含的字母字符。

这是使用"".join(some_list)成语来完成的,该成语获取字符串列表并将它们连接起来(换句话说,它通过按给定顺序将它们连接在一起来生成单个字符串)。

此处的some_list由生成器表达式[char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word]

提供

它的作用是逐步浏览给定单词中的每个字符,如果它是alphebetical,或者如果它没有在其中放入一个空白字符串,则将其放入列表中的地方。

例如

[char if char in 'abcdefghijklmnopqrstuvwyz' else '' for char in "hello."]

评估以下列表:

['h','e','l','l','o','']

然后按

进行评估
"".join(['h','e','l','l','o',''])

相当于

'h'+'e'+'l'+'l'+'o'+''

请注意,末尾添加的空白字符串不会产生任何影响。向任何字符串添加空字符串将再次返回相同的字符串。 而这最终会产生

"hello"

希望明白!

编辑#2:如果要包含用于标记十进制的句点,我们可以编写如下函数:

include_char = lambda pos, a_string: a_string[pos].isalnum() or a_string[pos] == '.' and a_string[pos-1:pos].isdigit()
words = "".join(map(include_char, fname)).split()

我们在这里做的是include_char函数检查一个字符是否是"字母数字" (即是一个字母或一个数字)或它是一个句点,并且它前面的字符是数字,并使用此函数去掉我们想要的字符串中的所有字符,并将它们连接成一个字符串,然后我们使用str.split方法将其分成一个字符串列表。

答案 1 :(得分:0)

此计划可能会对您有所帮助:

#I created a list of characters that I don't want \
# them to be considered as words!
char2remove = (".",",",";","!","?","*",":")
#Received an string of the user. 
string = raw_input("Enter your string: ")
#Make all the letters lower-case
string = string.lower()
#replace the special characters with white-space.
for char in char2remove:
    string = string.replace(char," ")
#Extract all the words in the new string (have repeats)
words = string.split(" ")
#creating a dictionary to remove repeats
to_count = dict()
for word in words:
    to_count[word]=0
#counting the word repeats. 
for word in to_count:
    #if there is space in a word, it is white-space!
    if word.isalpha():
        print word, string.count(word)

如下工作:

>>> ================================ RESTART ================================
>>> 
Enter your string: Hello, I am Bob. Hello to Bob *
i 1
am 1
to 1
bob 2
hello 2
>>> 

另一种方法是使用正则表达式删除所有非字母字符(以摆脱char2remove列表):

import re
regex = re.compile('[^a-zA-Z]')

your_str = raw_input("Enter String: ")
your_str = your_str.lower()
regex.sub(' ', your_str)

words = your_str.split(" ")
to_count = dict()
for word in words:
    to_count[word]=0

for word in to_count:
    if word.isalpha():
        print word, your_str.count(word)