使用python和.txt文件创建字典

时间:2014-10-20 17:06:55

标签: python python-2.7

我已从Project Gutenberg http://www.gutenberg.org/cache/epub/29765/pg29765.txt 下载了以下字典(如果您因为连接速度慢而无法点击该链接,则为25 MB)

在文件中,我要查找的关键字是大写字母,例如HALLUCINATION,然后在字典中有一些专用于发音的行,对我来说是过时的。

我想要提取的是定义,由" Defn"然后打印线条。我想出了这个相当丑陋的解决方案'

def lookup(search):
    find = search.upper()                   # transforms our search parameter all upper letters
    output = []                             # empty dummy list
    infile = open('webster.txt', 'r')       # opening the webster file for reading
    for line in infile:
        for part in line.split():
            if (find == part):
                for line in infile:
                    if (line.find("Defn:") == 0):  # ugly I know, but my only guess so far
                        output.append(line[6:])
                        print output               # uncertain about how to proceed
                        break

现在这当然只会打印出第一行" Defn:"。在Python中操作.txt文件时我是新手,因此对如何继续操作一无所知。我确实读了一个元组中的行,并注意到有特殊的新行字符。

所以我想以某种方式告诉Python继续阅读,直到它用完我想的新行字符,但也不计算必须读取的最后一行。

有人可以用我可能用来解决这个问题的有用功能来增强我(用一个最小的例子会很感激)。


所需输出的示例

查找("幻觉&#34)

out :徘徊;误入歧途;犯错;犯错; - 用于心理 流程。 [R.] Byron。

查找("幻觉&#34)

out :对没有现实或\ r \ n的物体的感知 由于\ r \ n而产生的没有相应外部原因的感觉 紊乱或神经系统,如震颤性谵妄;妄想。\ r \ n 幻觉总是脑紊乱的证据,并且是\ r \ n 疯狂的常见现象。 W. A. Hammond。


来自文字:

HALLUCINATE
Hal*lu"ci*nate, v. i. Etym: [L. hallucinatus, alucinatus, p. p. of
hallucinari, alucinari, to wander in mind, talk idly, dream.]

Defn: To wander; to go astray; to err; to blunder; -- used of mental
processes. [R.] Byron.

HALLUCINATION
Hal*lu`ci*na"tion, n. Etym: [L. hallucinatio cf. F. hallucination.]

1. The act of hallucinating; a wandering of the mind; error; mistake;
a blunder.
This must have been the hallucination of the transcriber. Addison.

2. (Med.)

Defn: The perception of objects which have no reality, or of
sensations which have no corresponding external cause, arising from
disorder or the nervous system, as in delirium tremens; delusion.
Hallucinations are always evidence of cerebral derangement and are
common phenomena of insanity. W. A. Hammond.

HALLUCINATOR
Hal*lu"ci*na`tor, n. Etym: [L.]

3 个答案:

答案 0 :(得分:0)

here我学到了一种处理内存映射文件的简单方法,并将它们用作字符串。然后你可以使用这样的东西来获得一个术语的第一个定义。

def lookup(search):
    term = search.upper()
    f = open('webster.txt')
    s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    index = s.find('\r\n\r\n' + term + '\r\n')
    if index == -1:
        return None
    definition = s.find('Defn:', index) + len('Defn:') + 1
    endline = s.find('\r\n\r\n', definition)
    return s[definition:endline]

print lookup('hallucination')
print lookup('hallucinate')

假设:

  • 每个术语至少有一个定义
  • 如果有多个,则仅返回第一个

答案 1 :(得分:0)

这是一个返回第一个定义的函数:

def lookup(word):
    word_upper = word.upper()
    found_word = False
    found_def = False
    defn = ''
    with open('dict.txt', 'r') as file:
        for line in file:
            l = line.strip()
            if not found_word and l == word_upper:
                found_word = True
            elif found_word and not found_def and l.startswith("Defn:"):
                found_def = True
                defn = l[6:]
            elif found_def and l != '':
                defn += ' ' + l
            elif found_def and l == '':
                return defn
    return False

print lookup('hallucination')

解释:我们需要考虑四种不同的情况。

  • 我们还没有找到这个词。我们必须将当前行与我们在大写字母中寻找的单词进行比较。如果他们是平等的,我们找到了这个词。
  • 我们找到了这个词,但还没有找到定义的开头。因此,我们必须寻找以Defn:开头的行。如果我们找到它,我们会在定义中添加该行(不包括Defn:的六个字符。
  • 我们已经找到了定义的开头。在这种情况下,我们只需将该行添加到定义中。
  • 我们已经找到了定义的开始,当前行是空的。定义已完成,我们返回定义。

如果我们什么也没找到,我们会返回False。

注意:有一些条目,例如CRANE,有多个定义。上面的代码无法处理。它将返回第一个定义。但是,考虑到文件的格式,编写完美的解决方案并不容易。

答案 2 :(得分:0)

您可以拆分为段落并使用搜索词的索引,并在以下位置找到第一个Defn段落:

def find_def(f,word):
    import re
    with open(f) as f:
        lines = f.read() 
        try:
            start = lines.index("{}\r\n".format(word)) # find where our search word is
        except ValueError: 
            return "Cannot find search term" 
        paras = re.split("\s+\r\n",lines[start:],10) # split into paragraphs using maxsplit = 10 as there are no grouping of paras longer in the definitions
        for para in paras:
            if para.startswith("Defn:"): # if para startswith Defn: we have what we need
                return para # return the  para

print(find_def("in.txt","HALLUCINATION"))

使用整个文件返回:

In [5]: print find_def("gutt.txt","VACCINATOR")
Defn: One who, or that which, vaccinates.

In [6]: print find_def("gutt.txt","HALLUCINATION")
Defn: The perception of objects which have no reality, or of
sensations which have no corresponding external cause, arising from
disorder or the nervous system, as in delirium tremens; delusion.
Hallucinations are always evidence of cerebral derangement and are
common phenomena of insanity. W. A. Hammond.

略短的版本:

def find_def(f,word):
    import re
    with open(f) as f:
        lines = f.read()
        try:
            start = lines.index("{}\r\n".format(word))
        except ValueError:
            return "Cannot find search term"
        defn = lines[start:].index("Defn:")
        return re.split("\s+\r\n",lines[start+defn:],1)[0]