我已从Project Gutenberg http://www.gutenberg.org/cache/epub/29765/pg29765.txt 下载了以下字典(如果您因为连接速度慢而无法点击该链接,则为25 MB)
在文件中,我要查找的关键字是大写字母,例如HALLUCINATION,然后在字典中有一些专用于发音的行,对我来说是过时的。
我想要提取的是定义,由" Defn"然后打印线条。我想出了这个相当丑陋的解决方案'
def lookup(search):
find = search.upper() # transforms our search parameter all upper letters
output = [] # empty dummy list
infile = open('webster.txt', 'r') # opening the webster file for reading
for line in infile:
for part in line.split():
if (find == part):
for line in infile:
if (line.find("Defn:") == 0): # ugly I know, but my only guess so far
output.append(line[6:])
print output # uncertain about how to proceed
break
现在这当然只会打印出第一行" Defn:"。在Python中操作.txt文件时我是新手,因此对如何继续操作一无所知。我确实读了一个元组中的行,并注意到有特殊的新行字符。
所以我想以某种方式告诉Python继续阅读,直到它用完我想的新行字符,但也不计算必须读取的最后一行。
有人可以用我可能用来解决这个问题的有用功能来增强我(用一个最小的例子会很感激)。
所需输出的示例 :
查找("幻觉&#34)
out :徘徊;误入歧途;犯错;犯错; - 用于心理 流程。 [R.] Byron。
查找("幻觉&#34)
out :对没有现实或\ r \ n的物体的感知 由于\ r \ n而产生的没有相应外部原因的感觉 紊乱或神经系统,如震颤性谵妄;妄想。\ r \ n 幻觉总是脑紊乱的证据,并且是\ r \ n 疯狂的常见现象。 W. A. Hammond。
来自文字:
HALLUCINATE
Hal*lu"ci*nate, v. i. Etym: [L. hallucinatus, alucinatus, p. p. of
hallucinari, alucinari, to wander in mind, talk idly, dream.]
Defn: To wander; to go astray; to err; to blunder; -- used of mental
processes. [R.] Byron.
HALLUCINATION
Hal*lu`ci*na"tion, n. Etym: [L. hallucinatio cf. F. hallucination.]
1. The act of hallucinating; a wandering of the mind; error; mistake;
a blunder.
This must have been the hallucination of the transcriber. Addison.
2. (Med.)
Defn: The perception of objects which have no reality, or of
sensations which have no corresponding external cause, arising from
disorder or the nervous system, as in delirium tremens; delusion.
Hallucinations are always evidence of cerebral derangement and are
common phenomena of insanity. W. A. Hammond.
HALLUCINATOR
Hal*lu"ci*na`tor, n. Etym: [L.]
答案 0 :(得分:0)
从here我学到了一种处理内存映射文件的简单方法,并将它们用作字符串。然后你可以使用这样的东西来获得一个术语的第一个定义。
def lookup(search):
term = search.upper()
f = open('webster.txt')
s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
index = s.find('\r\n\r\n' + term + '\r\n')
if index == -1:
return None
definition = s.find('Defn:', index) + len('Defn:') + 1
endline = s.find('\r\n\r\n', definition)
return s[definition:endline]
print lookup('hallucination')
print lookup('hallucinate')
假设:
答案 1 :(得分:0)
这是一个返回第一个定义的函数:
def lookup(word):
word_upper = word.upper()
found_word = False
found_def = False
defn = ''
with open('dict.txt', 'r') as file:
for line in file:
l = line.strip()
if not found_word and l == word_upper:
found_word = True
elif found_word and not found_def and l.startswith("Defn:"):
found_def = True
defn = l[6:]
elif found_def and l != '':
defn += ' ' + l
elif found_def and l == '':
return defn
return False
print lookup('hallucination')
解释:我们需要考虑四种不同的情况。
Defn:
开头的行。如果我们找到它,我们会在定义中添加该行(不包括Defn:
的六个字符。如果我们什么也没找到,我们会返回False。
注意:有一些条目,例如CRANE,有多个定义。上面的代码无法处理。它将返回第一个定义。但是,考虑到文件的格式,编写完美的解决方案并不容易。
答案 2 :(得分:0)
您可以拆分为段落并使用搜索词的索引,并在以下位置找到第一个Defn段落:
def find_def(f,word):
import re
with open(f) as f:
lines = f.read()
try:
start = lines.index("{}\r\n".format(word)) # find where our search word is
except ValueError:
return "Cannot find search term"
paras = re.split("\s+\r\n",lines[start:],10) # split into paragraphs using maxsplit = 10 as there are no grouping of paras longer in the definitions
for para in paras:
if para.startswith("Defn:"): # if para startswith Defn: we have what we need
return para # return the para
print(find_def("in.txt","HALLUCINATION"))
使用整个文件返回:
In [5]: print find_def("gutt.txt","VACCINATOR")
Defn: One who, or that which, vaccinates.
In [6]: print find_def("gutt.txt","HALLUCINATION")
Defn: The perception of objects which have no reality, or of
sensations which have no corresponding external cause, arising from
disorder or the nervous system, as in delirium tremens; delusion.
Hallucinations are always evidence of cerebral derangement and are
common phenomena of insanity. W. A. Hammond.
略短的版本:
def find_def(f,word):
import re
with open(f) as f:
lines = f.read()
try:
start = lines.index("{}\r\n".format(word))
except ValueError:
return "Cannot find search term"
defn = lines[start:].index("Defn:")
return re.split("\s+\r\n",lines[start+defn:],1)[0]