如何遍历字典以同时获得单词和符号的出现频率?

时间:2019-04-10 22:49:16

标签: python dictionary

我已经建立了一个函数来查找单词在文本文件中出现的频率,但是对于几个单词来说,该频率是错误的,因为该函数并未将单词与诸如“ happy”之类的符号分开。

我已经尝试使用split函数将每个“,”和每个“。”分开。但这是行不通的,因为教授不希望我们这样做,所以我也不允许将任何东西导入该函数。

下面的代码将文本文件转换成字典,然后使用单词或符号作为键,并使用频率作为值。

def getTokensFreq(file):
    dict = {}
    with open(file, 'r') as text:
        wholetext = text.read().split()
        for word in wholetext:
            if word in dict:
                dict[word] += 1
            else:
                dict[word] = 1
    return dict

我们正在使用名称为“ f”的文本文件。这是文件内部的内容。

I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy.

期望的结果是同时计算单词和符号。

{'i': 5, 'felt': 1, 'happy': 4, 'because': 2, 'saw': 1, 'the': 1, 'others': 1, 'were': 1, 'and': 1, 'knew': 1, 'should': 1, 'feel': 1, ',': 1, 'but': 1, 'was': 1, 'not': 1, 'really': 1, '.': 1}

这就是我得到的,其中某些单词和符号被视为一个单独的单词

  

{'I':5,'felt':1,'happy':2,'because':2,'saw':1,'the':1,'others':1,'were': 1,'和':1,'知道':1,'应该':1,'感觉':1,'快乐':1,'但是':1,'是':1,'不':1 ,“真的”:1,“快乐”:1}

3 个答案:

答案 0 :(得分:2)

这是为一个句子生成所需的频率字典的方法。要处理整个文件,只需为每行调用此代码即可更新字典的内容。

# init vars
f = "I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy."
d = {}

# count punctuation chars
d['.'] = f.count('.')
d[','] = f.count(',')

# remove . and , 
for word in f.replace(',', '').replace('.','').split(' '):
    if word not in d.keys():
        d[word] = 1
    else: 
        d[word] += 1

或者,您可以混合使用正则表达式和列表表达式,如下所示:

import re

# filter words and symbols
words   = re.sub('[^A-Za-z0-9\s]+', '', f).split(' ')
symbols = re.sub('[A-Za-z0-9\s]+', ' ', f).strip().split(' ')

# count occurrences
count_words   = dict(zip(set(words),   [words.count(w) for w in set(words)]))
count_symbols = dict(zip(set(symbols), [symbols.count(s) for s in set(symbols)]))

# parse results in dict
d = count_symbols.copy()
d.update(count_words)

输出:

{',': 1,
 '.': 1,
 'I': 5,
 'and': 1,
 'because': 2,
 'but': 1,
 'feel': 1,
 'felt': 1,
 'happy': 4,
 'knew': 1,
 'not': 1,
 'others': 1,
 'really': 1,
 'saw': 1,
 'should': 1,
 'the': 1,
 'was': 1,
 'were': 1}

使用循环将前两种方法运行1000倍,并捕获运行时间,证明第二种方法比第一种方法更快

答案 1 :(得分:0)

我的解决方案是先将所有符号替换为一个空格,然后按空格分割。我们将需要一些正则表达式的帮助。

import re

a = 'I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy.'

b =  re.sub('[^A-Za-z0-9]+', ' ', a)
print(b)
wholetext = b.split(' ')
print(wholetext)

答案 2 :(得分:0)

我的解决方案类似于Verse的解决方案,但它也需要使句子中的符号组成数组。然后,您可以使用for循环和字典来确定计数。

import re

a = 'I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy.'

b =  re.sub('[^A-Za-z0-9\s]+', ' ', a)
print(b)
wholetext = b.split(' ')
print(wholetext)
c = re.sub('[A-Za-z0-9\s]+', ' ', a)
symbols = c.strip().split(' ')
print(symbols)

# do the for loop stuff you did in your question but with wholetext and symbols

哦,我想你不能导入任何东西:(