Question

我已经建立了一个函数来查找单词在文本文件中出现的频率，但是对于几个单词来说，该频率是错误的，因为该函数并未将单词与诸如“ happy”之类的符号分开。

我已经尝试使用split函数将每个“，”和每个“。”分开。但这是行不通的，因为教授不希望我们这样做，所以我也不允许将任何东西导入该函数。

下面的代码将文本文件转换成字典，然后使用单词或符号作为键，并使用频率作为值。

def getTokensFreq(file):
    dict = {}
    with open(file, 'r') as text:
        wholetext = text.read().split()
        for word in wholetext:
            if word in dict:
                dict[word] += 1
            else:
                dict[word] = 1
    return dict

我们正在使用名称为“ f”的文本文件。这是文件内部的内容。

I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy.

期望的结果是同时计算单词和符号。

{'i': 5, 'felt': 1, 'happy': 4, 'because': 2, 'saw': 1, 'the': 1, 'others': 1, 'were': 1, 'and': 1, 'knew': 1, 'should': 1, 'feel': 1, ',': 1, 'but': 1, 'was': 1, 'not': 1, 'really': 1, '.': 1}

这就是我得到的，其中某些单词和符号被视为一个单独的单词

{'I'：5，'felt'：1，'happy'：2，'because'：2，'saw'：1，'the'：1，'others'：1，'were'： 1，'和'：1，'知道'：1，'应该'：1，'感觉'：1，'快乐'：1，'但是'：1，'是'：1，'不'：1 ，“真的”：1，“快乐”：1}

Answer 1

这是为一个句子生成所需的频率字典的方法。要处理整个文件，只需为每行调用此代码即可更新字典的内容。

# init vars
f = "I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy."
d = {}

# count punctuation chars
d['.'] = f.count('.')
d[','] = f.count(',')

# remove . and , 
for word in f.replace(',', '').replace('.','').split(' '):
    if word not in d.keys():
        d[word] = 1
    else: 
        d[word] += 1

或者，您可以混合使用正则表达式和列表表达式，如下所示：

import re

# filter words and symbols
words   = re.sub('[^A-Za-z0-9\s]+', '', f).split(' ')
symbols = re.sub('[A-Za-z0-9\s]+', ' ', f).strip().split(' ')

# count occurrences
count_words   = dict(zip(set(words),   [words.count(w) for w in set(words)]))
count_symbols = dict(zip(set(symbols), [symbols.count(s) for s in set(symbols)]))

# parse results in dict
d = count_symbols.copy()
d.update(count_words)

输出：

{',': 1,
 '.': 1,
 'I': 5,
 'and': 1,
 'because': 2,
 'but': 1,
 'feel': 1,
 'felt': 1,
 'happy': 4,
 'knew': 1,
 'not': 1,
 'others': 1,
 'really': 1,
 'saw': 1,
 'should': 1,
 'the': 1,
 'was': 1,
 'were': 1}

使用循环将前两种方法运行1000倍，并捕获运行时间，证明第二种方法比第一种方法更快。

Answer 2

我的解决方案是先将所有符号替换为一个空格，然后按空格分割。我们将需要一些正则表达式的帮助。

import re

a = 'I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy.'

b =  re.sub('[^A-Za-z0-9]+', ' ', a)
print(b)
wholetext = b.split(' ')
print(wholetext)

Answer 3

我的解决方案类似于Verse的解决方案，但它也需要使句子中的符号组成数组。然后，您可以使用for循环和字典来确定计数。

import re

a = 'I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy.'

b =  re.sub('[^A-Za-z0-9\s]+', ' ', a)
print(b)
wholetext = b.split(' ')
print(wholetext)
c = re.sub('[A-Za-z0-9\s]+', ' ', a)
symbols = c.strip().split(' ')
print(symbols)

# do the for loop stuff you did in your question but with wholetext and symbols

哦，我想你不能导入任何东西：（

如何遍历字典以同时获得单词和符号的出现频率？

3 个答案: