我正在尝试在文本中创建单词频率字典,但由于某种原因打印出额外的字符(我不确定这是否是我的文本或者是否是我的代码)并且它无法成功打印包含无效符号的行或单词!这是我的代码:
def parse_documentation(filename):
filename=open(filename, "r")
lines = filename.read();
invalidsymbols=["`","~","!", "@","#","$"]
for line in lines:
for x in invalidsymbols:
if x in line:
print(line)
print(x)
print(line.replace(x, ""))
freq={}
for word in line:
count=counter(word)
freq[word]=count
return freq
答案 0 :(得分:2)
您的代码有几个缺陷。我不会解决所有问题,但会指出正确的方向。
首先,read
将整个文件作为字符串读取。我不认为这是你的意图。请改用readlines()
将文件中的所有行作为列表。
def parse_documentation(filename):
filename=open(filename, "r")
lines = filename.readlines(); # returns a list of all lines in file
invalidsymbols=["`","~","!", "@","#","$"]
freq = {} # declare this OUTSIDE of your loop.
for line in lines:
for letter in line:
if letter in invalidsymbols:
print(letter)
line = line.replace(letter, ""))
print line #this should print the line without invalid symbols.
words = line.split() # Now get the words.
for word in line:
count=counter(word)
# ... Do your counter stuff here ...
return freq
其次,我非常怀疑你的counter
方法的运作方式。如果您打算计算单词数量,可以采用这种策略:
word
是否在freq
。freq
中,请添加它并将其映射到1.否则,请增加word
先前映射到的数字。这应该让你走上正轨。
答案 1 :(得分:1)
检查一下,它可能就是你想要的。顺便说一句,您的代码不正确Python
代码。那里有很多问题。
from collections import Counter
def parse_documentation(filename):
with open(filename,"r") as fin:
lines = fin.read()
#for sym in ["`","~","!","@","#","$"]: lines = lines.replace(sym,'')
lines = lines.translate(None,"`~!@#$") #thanks to @gnibbler's comment
freq = Counter(lines.split())
return freq
文字文件:
this is a text. text is that. @this #that
$this #!that is those
结果:
Counter({'this': 3, 'is': 3, 'that': 2, 'a': 1, 'that.': 1, 'text': 1, 'text.': 1, 'those': 1})
答案 2 :(得分:0)
您可能需要。line.split(' ')
否则for循环将遍历字母。
....
for word in line.split(' '):
count=counter(word)
...