如何编写一个从文本文件中读取的Python程序,并构建一个映射每个单词的字典

时间:2014-01-10 00:10:53

标签: python list python-2.7 dictionary

我在编写从文本文件中读取的Python程序时遇到困难,并构建了一个字典,该字典将文件中出现的每个单词映射到文件中紧跟该单词后面的所有单词的列表。单词列表可以按任何顺序排列,并且应包含重复项。

例如,密钥"and"可能会列出["then", "best", "after", ...]列表,其中列出了文本中"and"之后的所有字词。

任何想法都会有很大的帮助。

3 个答案:

答案 0 :(得分:1)

一些想法:

  1. 为您的输出设置collections.defaultdict。这是一个字典,其中包含尚不存在的键的默认值(在这种情况下,正如aelfric5578建议的那样,空list);
  2. 按顺序构建文件中所有单词的列表;和
  3. 您可以使用zip(lst, lst[1:])创建成对的连续列表元素。

答案 1 :(得分:0)

我会这样做:

from collections import defaultdict

# My example line :
s = 'In the face of ambiguity refuse the temptation to guess'

# Previous string is quite easy to tokenize but in real world, you'll have to :
# Remove comma, dot, etc...
# Probably encode to ascii (unidecode 3rd party module can be helpful)
# You'll also probably want to normalize case

lst = s.lower().split(' ')  # naive tokenizer

ddic = defaultdict(list)

for word1, word2 in zip(lst, lst[1:]):
    ddic[word1].append(word2)

# ddic contains what you want (but is a defaultdict)
# if you want to work with "classical" dictionnary, just cast it :
# (Often it's not needed)
dic = dict(ddic)

很抱歉,如果我似乎窃取了评论员的想法,但这几乎与我在一些项目中使用的代码相似(类似的文档算法预计算)

答案 2 :(得分:0)

欢迎来到stackoverflow.com

您确定需要字典吗? 如果文本很长,则会占用大量内存,只是为了几个条目重复几次相同的数据 如果您使用某个功能,它会随意为您提供所需的列表。 例如:

s = """In Newtonian physics, free fall is any motion
of a body where its weight is the only force acting
upon it. In the context of general relativity where
gravitation is reduced to a space-time curvature,
a body in free fall has no force acting on it and
it moves along a geodesic. The present article
concerns itself with free fall in the Newtonian domain."""

import re

def say_me(word,li=re.split('\s+',s)):
    for i,w in enumerate(li):
        if w==word:
            print '\n%s at index %d followed by\n%s' % (w,i,li[i+1:])

say_me('free')

结果

free at index 3 followed by
['fall', 'is', 'any', 'motion', 'of', 'a', 'body', 'where', 'its', 'weight', 'is', 'the', 'only', 'force', 'acting', 'upon', 'it.', 'In', 'the', 'context', 'of', 'general', 'relativity', 'where', 'gravitation', 'is', 'reduced', 'to', 'a', 'space-time', 'curvature,', 'a', 'body', 'in', 'free', 'fall', 'has', 'no', 'force', 'acting', 'on', 'it', 'and', 'it', 'moves', 'along', 'a', 'geodesic.', 'The', 'present', 'article', 'concerns', 'itself', 'with', 'free', 'fall', 'in', 'the', 'Newtonian', 'domain.']

free at index 38 followed by
['fall', 'has', 'no', 'force', 'acting', 'on', 'it', 'and', 'it', 'moves', 'along', 'a', 'geodesic.', 'The', 'present', 'article', 'concerns', 'itself', 'with', 'free', 'fall', 'in', 'the', 'Newtonian', 'domain.']

free at index 58 followed by
['fall', 'in', 'the', 'Newtonian', 'domain.']

assignement li=re.split('\s+',s)是一种将参数li绑定到作为参数传递的对象re.split('\s+',s)的方式。
此绑定仅执行一次:在解释器读取函数定义以创建函数对象的那一刻。它作为使用默认参数定义的参数。