我有2个文件: 第一个是.txt文档,这是一个像这样的字典:
Box OB
Table OB
Tiger AN
Lion AN
第二个文档是一个带有长文本的.txt文件。就像这个。
在一个盒子里。那只箱子里有狮子和老虎。
我想列出我的字典中出现的单词的次数。
有点像这样:
Box: 2
Lion: 1
Tiger: 1
这就是我所做的:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import codecs
file = codecs.open("MYtext.txt",'r','utf-8')
text = file.readlines()
line_list = []
for line in text:
line.rstrip('\n')
line_list.append(line)
d = {}
import nltk
with open("MYdict.txt",) as mydict:
for line in mydict:
(key, val) = line.split()
dictionary = dict(line.strip().split(None, 1) for line in mydict)
line_counter = 0
for line in line_list:
line_counter = line_counter + 1
for word in line.split():
if word in line_list in dictionary.keys():
line_list = dictionary[word]
line_list.append(line_counter)
dictionary[word] = line_list
else:
line_list = []
line_list.append(line_counter)
dictionary[word] = line_list
for key in sorted(dictionary.keys()):
print key, len(dictionary[key])
我收到此错误
$ /var/folders/3h/w3_12zfs7hs6zcrlnpk8gdg40000gn/T/Cleanup\ At\ Startup/test\ 44-405955317.432.py.command ; exit;
Traceback (most recent call last):
File "/private/var/folders/3h/w3_12zfs7hs6zcrlnpk8gdg40000gn/T/Cleanup At Startup/test 44-405955317.367.py", line 33, in <module>
for key in sorted(dictionary.keys()):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
logout
[Process completed]
你能帮忙吗?我是新手。我是语言学家,而不是程序员。
答案 0 :(得分:0)
您的代码存在一些错误,这些错误最初与您未获得的错误无关。
您应该将import
分组到文件顶部。 import nltk
行不应位于代码中间
你首先要处理字典。关于这一点,你有一个外环(for line in mydict
),然后,在里面,还有另一个循环(实际上是一个列表理解)。不好。你可以简单地使用:
with open("MYdict.txt",) as mydict:
dictionary = dict(line.strip().split(None, 1) for line in mydict)
但是以小写字母保存字符串会很好:
with open("MYdict.txt",) as mydict:
dictionary = {x[0].lower(): x[1] for x in [line.strip().split(None, 1) for line in mydict]}
为了阅读,删除和存储文本中的行,您可以使用字符串的splitlines
方法,如下所示:
with codecs.open("MYtext.txt",'r','utf-8') as mytext:
line_list = mytext.read().splitlines()
然而,最好逐行处理文件,而不是保存所有行。
没有必要使用for
循环来计算行数。只需使用len(line_list)
。
我不太了解你在代码的最后部分所做的事情。你似乎搞砸了一些以前的变量(比如前一个循环中的line
)并覆盖了line_list
变量。
这是一种方法:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import codecs
with open("MYdict.txt",) as mydict:
dictionary = {x[0].lower(): x[1] for x in [line.strip().split(None, 1) for line in mydict]}
word_count = {}
with codecs.open("MYtext.txt",'r','utf-8') as mytext:
for line in mytext:
for word in line.strip().split():
word = word.rstrip('.,')
if word in dictionary.keys():
word_count[word] = word_count.get(word, 0) + 1
for key in sorted(word_count, key=word_count.get, reverse=True):
print "%s : %i" % (key, word_count[key])
您当然可以将两个for循环合并为一个,只需使用for word in (line.strip().split() for line in mytext)
答案 1 :(得分:0)
您获得的错误可能与"MYdict.txt"
的编码有关。我想如果您将codecs.open
方法与'utf-8'
标志一起使用,就像对其他文件一样,您可以解决这个问题。
如果我理解你喜欢做什么,我想我会这样解决它:
import codecs
with codecs.open('MYdict.txt', 'r', 'utf-8') as f:
wordslist = [line.split()[0].lower() for line in f]
with codecs.open('MYtext.txt', 'r', 'utf-8') as f:
text = f.read().lower()
counts = {}
for word in wordslist:
counts[word] = text.count(word)
# alternatively instead of the last 3 lines
# you can use a "dictionary comprehension"
counts = {word: text.count(word) for word in wordslist}
对于漂亮打印您可以使用的输出:
import pprint
pprint.pprint(counts)