从一个巨大的文本文件制作一个python字典?

时间:2016-02-20 14:09:28

标签: python

我有一个像tab这样的文件文件:

20001   World Economies
20002   Bill Clinton
20004   Internet Law
20005   Philipines Elections
20006   Israel Politics
20008   Golf
20009   Music
20010   Disasters

这是一个巨大的文件,由100对这样的对组成。如何使用此文件在python中创建字典?

def get_pair(line):
  key, sep, value = line.strip().partition("\t")
  return int(key), value


with open("TopicMapped.txt") as fd:    
           d = dict(get_pair(line) for line in fd)

fd=open('dictionary.txt', 'w')
print>> fd,d  

但是,将这个字典打印到文件会给我一个空文件吗?

2 个答案:

答案 0 :(得分:3)

您可以使用以下简单代码轻松完成此操作:

fID=open('TopicMapped.txt')

myDict=dict() #init empty dictionary

for line in fID:
    #read the file line-by-line (if it's huge, it might be cumbersome to import it entirely in memory, e.g. using readlines())
    # and also remove newline tags
    line=line.rstrip()

    #create a list where the first element is the number and the second element is the text
    line=line.split("\t")

    #update dictionary
    myDict[line[0]]=line[1]

print myDict
fID.close()

此代码返回以下字典

{'20010': 'Disasters', '20006': 'Israel Politics', '20005': 'Philipines Elections', '20004': 'Internet Law', '20002': 'Bill Clinton', '20001': 'World Economies', '20009': 'Music', '20008': 'Golf'}

如果您希望数字为整数而不是字符串,则可以执行类似

的操作
myDict[int(line[0])]=line[1] #update dictionary

结果字典将是

{20001: 'World Economies', 20002: 'Bill Clinton', 20004: 'Internet Law', 20005: 'Philipines Elections', 20006: 'Israel Politics', 20008: 'Golf', 20009: 'Music', 20010: 'Disasters'}

答案 1 :(得分:3)

您自己的代码实际上有效,看起来它会为您提供一个空文件,因为您在关闭它之前测试该文件:

In [15]: fd=open('dictionary.txt', 'w')

In [16]: print >> fd, d
# looks empty
In [17]: cat dictionary.txt 
# actually close the file so what is in the buffer is written to disk
In [18]: fd.close()
# now you see the data
In [19]: cat dictionary.txt
{20001: '  World Economies', 20002: '  Bill Clinton', 20004: '  Internet Law', 20005: '  Philipines Elections', 20006: '  Israel Politics', 20008: '  Golf', 20009: '  Music', 20010: '  Disasters'}

您可以使用 dict comprehension 进行操作,并使用with打开文件,它会自动关闭它们,以避免像上面的代码那样的简单错误:

In [7]: with open("text.txt") as f:
            dct = {int(k): v.rstrip() for line in f for k, v  in (line.split(None, 1),)}
   ...:     

In [8]: dct
Out[8]: 
{20001: 'World Economies',
 20002: 'Bill Clinton',
 20004: 'Internet Law',
 20005: 'Philipines Elections',
 20006: 'Israel Politics',
 20008: 'Golf',
 20009: 'Music',
 20010: 'Disasters'}

如果要存储在文件中,请使用json模块:

In [13]: import json

In [14]: with open("text.txt") as f, open("out.json","w") as out:
            json.dump({int(k): v.rstrip() for line in f for k, v  in (line.split(None, 1),)}, out)
   ....:     

In [15]: cat out.json
{"20001": "World Economies", "20002": "Bill Clinton", "20004": "Internet Law", "20005": "Philipines Elections", "20006": "Israel Politics", "20008": "Golf", "20009": "Music", "20010": "Disasters"}

json总是将整数解析为字符串,所以如果你真的想要整数,你可以pickle你的字典:

In [8]: import pickle

In [9]: with open("text.txt") as f, open("out.pkl","wb") as out:
            pickle.dump({int(k): v.rstrip() for line in f for k, v  in (line.split(None, 1),)}, out)
   ...:     

In [10]: with open("out.pkl","rb") as in_fle:
            dct = pickle.load(in_fle)
   ....:     

In [11]: dct
Out[11]: 
{20001: 'World Economies',
 20002: 'Bill Clinton',
 20004: 'Internet Law',
 20005: 'Philipines Elections',
 20006: 'Israel Politics',
 20008: 'Golf',
 20009: 'Music',
 20010: 'Disasters'}

您也可以使用csv lib进行解析:

import csv
with open("text.txt") as f:
        dct = {int(k): v for k,v in csv.reader(f, delimiter="\t")}