Question

我已从Project Gutenberg http://www.gutenberg.org/cache/epub/29765/pg29765.txt 下载了以下字典（如果您因为连接速度慢而无法点击该链接，则为25 MB）

在文件中，我要查找的关键字是大写字母，例如HALLUCINATION，然后在字典中有一些专用于发音的行，对我来说是过时的。

我想要提取的是定义，由＆＃34; Defn＆＃34;然后打印线条。我想出了这个相当丑陋的解决方案＆＃39;

def lookup(search):
    find = search.upper()                   # transforms our search parameter all upper letters
    output = []                             # empty dummy list
    infile = open('webster.txt', 'r')       # opening the webster file for reading
    for line in infile:
        for part in line.split():
            if (find == part):
                for line in infile:
                    if (line.find("Defn:") == 0):  # ugly I know, but my only guess so far
                        output.append(line[6:])
                        print output               # uncertain about how to proceed
                        break

现在这当然只会打印出第一行＆＃34; Defn：＆＃34;。在Python中操作.txt文件时我是新手，因此对如何继续操作一无所知。我确实读了一个元组中的行，并注意到有特殊的新行字符。

所以我想以某种方式告诉Python继续阅读，直到它用完我想的新行字符，但也不计算必须读取的最后一行。

有人可以用我可能用来解决这个问题的有用功能来增强我（用一个最小的例子会很感激）。

所需输出的示例 ：

查找（＆＃34;幻觉＆＃34）

out ：徘徊;误入歧途;犯错;犯错; - 用于心理流程。 [R.] Byron。

查找（＆＃34;幻觉＆＃34）

out ：对没有现实或\ r \ n的物体的感知由于\ r \ n而产生的没有相应外部原因的感觉紊乱或神经系统，如震颤性谵妄;妄想。\ r \ n 幻觉总是脑紊乱的证据，并且是\ r \ n 疯狂的常见现象。 W. A. Hammond。

来自文字：

HALLUCINATE
Hal*lu"ci*nate, v. i. Etym: [L. hallucinatus, alucinatus, p. p. of
hallucinari, alucinari, to wander in mind, talk idly, dream.]

Defn: To wander; to go astray; to err; to blunder; -- used of mental
processes. [R.] Byron.

HALLUCINATION
Hal*lu`ci*na"tion, n. Etym: [L. hallucinatio cf. F. hallucination.]

1. The act of hallucinating; a wandering of the mind; error; mistake;
a blunder.
This must have been the hallucination of the transcriber. Addison.

2. (Med.)

Defn: The perception of objects which have no reality, or of
sensations which have no corresponding external cause, arising from
disorder or the nervous system, as in delirium tremens; delusion.
Hallucinations are always evidence of cerebral derangement and are
common phenomena of insanity. W. A. Hammond.

HALLUCINATOR
Hal*lu"ci*na`tor, n. Etym: [L.]

Answer 1

从here我学到了一种处理内存映射文件的简单方法，并将它们用作字符串。然后你可以使用这样的东西来获得一个术语的第一个定义。

def lookup(search):
    term = search.upper()
    f = open('webster.txt')
    s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    index = s.find('\r\n\r\n' + term + '\r\n')
    if index == -1:
        return None
    definition = s.find('Defn:', index) + len('Defn:') + 1
    endline = s.find('\r\n\r\n', definition)
    return s[definition:endline]

print lookup('hallucination')
print lookup('hallucinate')

假设：

每个术语至少有一个定义
如果有多个，则仅返回第一个

Answer 2

这是一个返回第一个定义的函数：

def lookup(word):
    word_upper = word.upper()
    found_word = False
    found_def = False
    defn = ''
    with open('dict.txt', 'r') as file:
        for line in file:
            l = line.strip()
            if not found_word and l == word_upper:
                found_word = True
            elif found_word and not found_def and l.startswith("Defn:"):
                found_def = True
                defn = l[6:]
            elif found_def and l != '':
                defn += ' ' + l
            elif found_def and l == '':
                return defn
    return False

print lookup('hallucination')

解释：我们需要考虑四种不同的情况。

我们还没有找到这个词。我们必须将当前行与我们在大写字母中寻找的单词进行比较。如果他们是平等的，我们找到了这个词。
我们找到了这个词，但还没有找到定义的开头。因此，我们必须寻找以Defn:开头的行。如果我们找到它，我们会在定义中添加该行（不包括Defn:的六个字符。
我们已经找到了定义的开头。在这种情况下，我们只需将该行添加到定义中。
我们已经找到了定义的开始，当前行是空的。定义已完成，我们返回定义。

如果我们什么也没找到，我们会返回False。

注意：有一些条目，例如CRANE，有多个定义。上面的代码无法处理。它将返回第一个定义。但是，考虑到文件的格式，编写完美的解决方案并不容易。

Answer 3

您可以拆分为段落并使用搜索词的索引，并在以下位置找到第一个Defn段落：

def find_def(f,word):
    import re
    with open(f) as f:
        lines = f.read() 
        try:
            start = lines.index("{}\r\n".format(word)) # find where our search word is
        except ValueError: 
            return "Cannot find search term" 
        paras = re.split("\s+\r\n",lines[start:],10) # split into paragraphs using maxsplit = 10 as there are no grouping of paras longer in the definitions
        for para in paras:
            if para.startswith("Defn:"): # if para startswith Defn: we have what we need
                return para # return the  para

print(find_def("in.txt","HALLUCINATION"))

使用整个文件返回：

In [5]: print find_def("gutt.txt","VACCINATOR")
Defn: One who, or that which, vaccinates.

In [6]: print find_def("gutt.txt","HALLUCINATION")
Defn: The perception of objects which have no reality, or of
sensations which have no corresponding external cause, arising from
disorder or the nervous system, as in delirium tremens; delusion.
Hallucinations are always evidence of cerebral derangement and are
common phenomena of insanity. W. A. Hammond.

略短的版本：

def find_def(f,word):
    import re
    with open(f) as f:
        lines = f.read()
        try:
            start = lines.index("{}\r\n".format(word))
        except ValueError:
            return "Cannot find search term"
        defn = lines[start:].index("Defn:")
        return re.split("\s+\r\n",lines[start+defn:],1)[0]

使用python和.txt文件创建字典

3 个答案: