我是python的新手,现在我有一个这样的txt文件:
Doc 1
aaa bbb ccc ddd ...
Doc 2
eee fff ggg hhh ...
Doc 3
aaa ggg iii kkk ...
...
Doc 11
eee ttt uuu zzz ...
基本上我想要做的是计算每个文档的术语频率,并将它们放入11个不同的词典中(例如" For Doc1,{' aaa':10,' bbb& #39;:5 ...}",构建术语 - 最后的文档矩阵。我当前的代码如下:
# split te text file into 11 documents(paragraphs)
f = open('filename.txt', 'r')
data = f.read()
docs = data.split("\n\n")
# creat 11 tf dictionaries
dictstr = 'tf'
dictlist = [dictstr + str(i) for i in range(10)]
for i in range(10):
for line in docs[i]:
tokens = line.split()
for term in tokens:
term = term.lower()
term = term.replace(',', '')
term = term.replace('"', '')
term = term.replace('.', '')
term = term.replace('/', '')
term = term.replace('(', '')
term = term.replace(')', '')
if not term in dict['tfi']:
dict['tfi'][term] = 1
else:
dict['tfi'][term] += 1
最后一个问题" if - else"一步,我被困在这里。谁能告诉我如何处理它? (不要使用像#&p; panda"等其他软件包。)谢谢! The txt resource's here
答案 0 :(得分:0)
此代码读入您提供的文件,在一次传递中删除不需要的字符(VS为每次使用.replace
创建一个新字符串)并将字数保存在名为result
的字典中。密钥是文档('XXX9'
- > 'tf9'
),值为collections.Counter
个对象,其中包含单词count。
>>> import re
... from collections import Counter
...
... with open('filename.txt', 'r') as f:
... data = f.read().lower()
...
... clean_data = re.sub(r'[,"./()]', '', data)
...
... result = {}
... for line in clean_data.splitlines():
... if not line:
... continue # skip blank lines
... elif line.startswith('xxx'):
... doc_num = 'tf{}'.format(line[3:])
... else:
... result[doc_num] = Counter(line.split())
...
>>> list(result.keys())
['tf7', 'tf10', 'tf5', 'tf2', 'tf9', 'tf4', 'tf11', 'tf3', 'tf6', 'tf8', 'tf1']
>>> for k, v in list(result['tf1'].items())[:15]:
... print("'{}': {}".format(k, v))
...
'class': 1
'then': 1
'emerge': 1
'industry': 1
'common': 1
'ourselves': 2
'models': 1
'short': 1
'mgi': 1
'it': 1
'actionable': 1
'time': 1
'why': 1
'theory': 1
'equip': 2
如果需要进行任何更改以帮助回答您的问题,请与我们联系!