我有一个文本文件,我在这里计算行数,字符总数和单词总和。如何使用string.replace()删除诸如(the,for,a)之类的停用词来清理数据
我现在有以下代码。
实施例。如果文本文件包含以下行:
"The only words to count are Apple and Grapes for this text"
应输出:
2 Apple
2 Grapes
1 words
1 only
1 text
不应输出如下字样:
以下是我现在的代码。
# Open the input file
fname = open('2013_honda_accord.txt', 'r').read()
# COUNT CHARACTERS
num_chars = len(fname)
# COUNT LINES
num_lines = fname.count('\n')
#COUNT WORDS
fname = fname.lower() # convert the text to lower first
words = fname.split()
d = {}
for w in words:
# if the word is repeated - start count
if w in d:
d[w] += 1
# if the word is only used once then give it a count of 1
else:
d[w] = 1
# Add the sum of all the repeated words
num_words = sum(d[w] for w in d)
lst = [(d[w], w) for w in d]
# sort the list of words in alpha for the same count
lst.sort()
# list word count from greatest to lowest (will also show the sort in reserve order Z-A)
lst.reverse()
# output the total number of characters
print('Your input file has characters = ' + str(num_chars))
# output the total number of lines
print('Your input file has num_lines = ' + str(num_lines))
# output the total number of words
print('Your input file has num_words = ' + str(num_words))
print('\n The 30 most frequent words are \n')
# print the number of words as a count from the text file with the sum of each word used within the text
i = 1
for count, word in lst[:10000]:
print('%2s. %4s %s' % (i, count, word))
i += 1
由于
答案 0 :(得分:4)
打开并阅读文件(fname = open('2013_honda_accord.txt', 'r').read()
)后,您可以输入以下代码:
blacklist = ["the", "to", "are", "for", "this"] # Blacklist of words to be filtered out
for word in blacklist:
fname = fname.replace(word, "")
# The above causes multiple spaces in the text (e.g. ' Apple Grapes Apple')
while " " in fname:
fname = fname.replace(" ", " ") # Replace double spaces by one while double spaces are in text
修改强> 为了避免包含不需要的单词的单词出现问题,你可以这样做(假设单词在句子中间):
blacklist = ["the", "to", "are", "for", "this"] # Blacklist of words to be filtered out
for word in blacklist:
fname = fname.replace(" " + word + " ", " ")
# Or .'!? ect.
此处不需要检查双倍空格。
希望这有帮助!
答案 1 :(得分:1)
您可以通过编写一个简单的函数轻松终止这些单词:
#This function drops the restricted words from a sentece.
#Input - sentence, list of restricted words (restricted list should be all lower case)
#Output - list of allowed words.
def restrict (sentence, restricted):
return list(set([word for word in sentence.split() if word.lower() not in restricted]))
然后您可以随时使用此功能(在字数统计之前或之后)。
例如:
restricted = ["the", "to", "are", "and", "for", "this"]
sentence = "The only words to count are Apple and Grapes for this text"
word_list = restrict(sentence, restricted)
print word_list
会打印:
["count", "Apple", "text", "only", "Grapes", "words"]
当然你可以添加空单词删除(双空格):
return list(set([word for word in sentence.split() if word.lower() not in restricted and len(word) > 0]))