我正在尝试计算单词出现在txt文件中的次数。该程序似乎工作,但我无法阻止它计算我认为是白色空间(我的结果中的60,这没有任何意义,因为有超过60个空格)。是否有一种剥离的方式 - 并且 - 从单词的中间?
import string
words = {}
def unique_words2(filename):
strip = string.whitespace + string.punctuation + string.digits + "\"'"
for line in open(filename):
for word in line.lower().split():
if word == " ":
continue
else:
word = word.strip(strip)
words[word] = words.get(word, 0) + 1
for word in sorted(words):
print("{0} {1}".format(word, words[word]))
unique_words2("alice.txt")
前5个结果显示;
60
a 627
a--i'm 1
a-piece 1
abide 1
我希望消除像1,3和4这样的结果。
答案 0 :(得分:0)
python字符串的strip
方法仅从字符串的开头和结尾删除指定的字符。使用translate
方法可以解决这个问题。 (这是输出3和4的原因)。输出一是由不同的问题引起的。如果出现strip
中由完全字符组成的单词,它将包含在空字符串下的单词字典中。
调整代码:
import string
def unique_words2(filename):
words = {}
strip = string.whitespace + string.punctuation + string.digits + "\"'"
translation = {ord(bad):None for bad in strip}
for line in open(filename):
for word in line.lower().split():
word = word.translate(translation)
if word:
words[word] = words.get(word, 0) + 1
for word in sorted(words):
print("{0} {1}".format(word, words[word]))
unique_words2("alice.txt")
答案 1 :(得分:0)
来自https://docs.python.org/2/library/string.html:
string.split(s[, sep[, maxsplit]])
单词由空格字符(空格,制表符,换行符,返回表格,换页符)的任意字符串分隔
用空格替换任何其他分隔符(如' - ')应该可以解决问题。无需处理重复的空间,因为它们将被视为单个空间。
def unique_words2(filename):
strip = string.whitespace + string.punctuation + string.digits + "\"'"
for line in open(filename):
separators = '-_|'
for sep in seperators:
line = line.replace(sep, ' ')
for word in line.lower().split():
word = word.strip(strip)
if word:
words[word] = words.get(word, 0) + 1
for word in sorted(words):
print("{0} {1}".format(word, words[word]))