我有一个文本文件,我在这里计算行数,字符总数和单词总和。如何使用string.replace()删除诸如(the,for,a)之类的停用词来清理数据
我现在有以下代码。
实施例。如果文本文件包含以下行:
1 Buttons
1 Shares
1 words
1 only
1 text
应输出:
1 Butns (this is a problem)
1 Shs (this is a problem)
1 words
1 only
1 text
虽然我的代码没有输出我已列入黑名单的停用词,但如果其中包含任何其他单词,它也会删除停用词。以下是我的代码输出。
# Open the input file
fname = open('2013_honda_accord.txt', 'r').read()
# COUNT CHARACTERS
num_chars = len(fname)
# COUNT LINES
num_lines = fname.count('\n')
#COUNT WORDS
fname = fname.lower() # convert the text to lower first
# Remove Stop words
blacklist = ["the", "to", "are", "and", "for", "this" ] # Blacklist of words to be filtered out
for word in blacklist:
fname = fname.replace(word, "")
# Removing special characters from the word count
get_alphabetical_characters = lambda word: "".join([char if char in 'abcdefghijklmnopqrstuvwxyz1234567890-' else '' for char in word])
words = list(map(get_alphabetical_characters, fname.split()))
d = {}
for w in words:
# if the word is repeated - start count
if w in d:
d[w] += 1
# if the word is only used once then give it a count of 1
else:
d[w] = 1
# Add the sum of all the repeated words
num_words = sum(d[w] for w in d)
lst = [(d[w], w) for w in d]
# sort the list of words in alpha for the same count
lst.sort()
# list word count from greatest to lowest (will also show the sort in reserve order Z-A)
lst.reverse()
# output the total number of characters
print('Your input file has characters = ' + str(num_chars))
# output the total number of lines
print('Your input file has num_lines = ' + str(num_lines))
# output the total number of words
print('Your input file has num_words = ' + str(num_words))
print('\n The 30 most frequent words are \n')
# print the number of words as a count from the text file with the sum of each word used within the text
i = 1
for count, word in lst[:10000]:
print('%2s. %4s %s' % (i, count, word))
i += 1
以下是我现在的代码。
<solid android:color="@color/white"/>
由于
答案 0 :(得分:1)
假设您的分析不需要标点符号,您可以执行以下操作 -
punctuation_list = ['?',',','.'] # non exhaustive
for punctuation in punctuation_list:
fname = fname.replace(punctuation, "")
blacklist = ["the", "to", "are", "and", "for", "this" ]
for word in blacklist:
fname = fname.replace(" "+word+" ", " ") #replace StopWord preceded by a space and followed by a space with a space
答案 1 :(得分:0)
您正在删除“to”,“are”,并替换它们。
# Remove Stop words
blacklist = ["the", "to", "are", "and", "for", "this" ]
# Blacklist of words to be filtered out
for word in blacklist:
fname = fname.replace(word, "")
答案 2 :(得分:0)
将过滤词从停用词移到您创建的d
时,字典将字词映射到计数。在那里添加一行 - if w not in blacklist:
- 跳过黑名单中包含的单词将删除停用词而不更改其他单词。
#COUNT WORDS
fname = fname.lower() # convert the text to lower first
# Removing special characters from the word count
get_alphabetical_characters = lambda word: "".join([char if char in 'abcdefghijklmnopqrstuvwxyz1234567890-' else '' for char in word])
words = list(map(get_alphabetical_characters, fname.split()))
# Remove Stop words
blacklist = ["the", "to", "are", "and", "for", "this" ] # Blacklist of words to be filtered out
d = {}
for w in words:
# Do not count words in the blacklist
if w not in blacklist:
# if the word is repeated - start count
if w in d:
d[w] += 1
# if the word is only used once then give it a count of 1
else:
d[w] = 1