Question

我正在尝试在文本中创建单词频率字典，但由于某种原因打印出额外的字符（我不确定这是否是我的文本或者是否是我的代码）并且它无法成功打印包含无效符号的行或单词！这是我的代码：

 def parse_documentation(filename):
    filename=open(filename, "r") 
    lines = filename.read(); 
    invalidsymbols=["`","~","!", "@","#","$"]
    for line in lines: 
        for x in invalidsymbols:
            if x in line: 
                print(line) 
                print(x) 
                print(line.replace(x, "")) 
                freq={}
            for word in line:
                count=counter(word)
        freq[word]=count
    return freq

Answer 1

您的代码有几个缺陷。我不会解决所有问题，但会指出正确的方向。

首先，read将整个文件作为字符串读取。我不认为这是你的意图。请改用readlines()将文件中的所有行作为列表。

def parse_documentation(filename):
    filename=open(filename, "r") 
    lines = filename.readlines(); # returns a list of all lines in file
    invalidsymbols=["`","~","!", "@","#","$"]
    freq = {} # declare this OUTSIDE of your loop.
    for line in lines:
        for letter in line:
            if letter in invalidsymbols:
                print(letter) 
                line = line.replace(letter, ""))
        print line #this should print the line without invalid symbols.

        words = line.split() # Now get the words.

        for word in line:
            count=counter(word)
            # ... Do your counter stuff here ...

    return freq

其次，我非常怀疑你的counter方法的运作方式。如果您打算计算单词数量，可以采用这种策略：

检查word是否在freq。
如果它不在freq中，请添加它并将其映射到1.否则，请增加word先前映射到的数字。

这应该让你走上正轨。

Answer 2

检查一下，它可能就是你想要的。顺便说一句，您的代码不正确Python代码。那里有很多问题。

from collections import Counter

def parse_documentation(filename):
    with open(filename,"r") as fin:
        lines = fin.read()
    #for sym in ["`","~","!","@","#","$"]: lines = lines.replace(sym,'')
    lines = lines.translate(None,"`~!@#$")    #thanks to @gnibbler's comment
    freq = Counter(lines.split())
    return freq

文字文件：

this is a text. text is that. @this #that
$this #!that is those

结果：

Counter({'this': 3, 'is': 3, 'that': 2, 'a': 1, 'that.': 1, 'text': 1, 'text.': 1, 'those': 1})

Answer 3

您可能需要。line.split(' ')否则for循环将遍历字母。

....
for word in line.split(' '):
    count=counter(word)
...

如何计算文本中的单词并附加到字典中？

3 个答案: