我已经建立了一个函数来查找单词在文本文件中出现的频率,但是对于几个单词来说,该频率是错误的,因为该函数并未将单词与诸如“ happy”之类的符号分开。
我已经尝试使用split函数将每个“,”和每个“。”分开。但这是行不通的,因为教授不希望我们这样做,所以我也不允许将任何东西导入该函数。
下面的代码将文本文件转换成字典,然后使用单词或符号作为键,并使用频率作为值。
def getTokensFreq(file):
dict = {}
with open(file, 'r') as text:
wholetext = text.read().split()
for word in wholetext:
if word in dict:
dict[word] += 1
else:
dict[word] = 1
return dict
我们正在使用名称为“ f”的文本文件。这是文件内部的内容。
I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy.
期望的结果是同时计算单词和符号。
{'i': 5, 'felt': 1, 'happy': 4, 'because': 2, 'saw': 1,
'the': 1, 'others': 1, 'were': 1, 'and': 1, 'knew': 1, 'should': 1,
'feel': 1, ',': 1, 'but': 1, 'was': 1, 'not': 1, 'really': 1, '.': 1}
这就是我得到的,其中某些单词和符号被视为一个单独的单词
{'I':5,'felt':1,'happy':2,'because':2,'saw':1,'the':1,'others':1,'were': 1,'和':1,'知道':1,'应该':1,'感觉':1,'快乐':1,'但是':1,'是':1,'不':1 ,“真的”:1,“快乐”:1}
答案 0 :(得分:2)
这是为一个句子生成所需的频率字典的方法。要处理整个文件,只需为每行调用此代码即可更新字典的内容。
# init vars
f = "I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy."
d = {}
# count punctuation chars
d['.'] = f.count('.')
d[','] = f.count(',')
# remove . and ,
for word in f.replace(',', '').replace('.','').split(' '):
if word not in d.keys():
d[word] = 1
else:
d[word] += 1
或者,您可以混合使用正则表达式和列表表达式,如下所示:
import re
# filter words and symbols
words = re.sub('[^A-Za-z0-9\s]+', '', f).split(' ')
symbols = re.sub('[A-Za-z0-9\s]+', ' ', f).strip().split(' ')
# count occurrences
count_words = dict(zip(set(words), [words.count(w) for w in set(words)]))
count_symbols = dict(zip(set(symbols), [symbols.count(s) for s in set(symbols)]))
# parse results in dict
d = count_symbols.copy()
d.update(count_words)
输出:
{',': 1,
'.': 1,
'I': 5,
'and': 1,
'because': 2,
'but': 1,
'feel': 1,
'felt': 1,
'happy': 4,
'knew': 1,
'not': 1,
'others': 1,
'really': 1,
'saw': 1,
'should': 1,
'the': 1,
'was': 1,
'were': 1}
使用循环将前两种方法运行1000倍,并捕获运行时间,证明第二种方法比第一种方法更快。
答案 1 :(得分:0)
我的解决方案是先将所有符号替换为一个空格,然后按空格分割。我们将需要一些正则表达式的帮助。
import re
a = 'I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy.'
b = re.sub('[^A-Za-z0-9]+', ' ', a)
print(b)
wholetext = b.split(' ')
print(wholetext)
答案 2 :(得分:0)
我的解决方案类似于Verse的解决方案,但它也需要使句子中的符号组成数组。然后,您可以使用for循环和字典来确定计数。
import re
a = 'I felt happy because I saw the others were happy and because I knew I should feel happy, but I was not really happy.'
b = re.sub('[^A-Za-z0-9\s]+', ' ', a)
print(b)
wholetext = b.split(' ')
print(wholetext)
c = re.sub('[A-Za-z0-9\s]+', ' ', a)
symbols = c.strip().split(' ')
print(symbols)
# do the for loop stuff you did in your question but with wholetext and symbols
哦,我想你不能导入任何东西:(