Question

我有一个文本文件列表，每个文本文件代表一年，从1880年开始一直持续到2014年。每个文件都包含一个名称，性别和该年名称出现次数的列表，类似对此：

玛丽，F，14406

安娜，F，5773

海伦，F，5230

现在，我想创建一个读取所有文件并返回表单嵌套字典的函数：

名称 - ＆gt;年 - ＆gt;计数

也就是说，name是键，它的值是另一个字典，我将所有年份作为键添加，并将名称的出现次数作为其值。

这就是我想出来的：

def files_to_dict(folder_name):

  dic = defaultdict(dict)

  for filename in glob.glob( os.path.join( folder_name, '*.txt' )):

      with open( filename, 'r' ) as yearFile:

          # Each file is named yob[year].txt e.g yob2011.txt, hence  
          # I'm slicing the filename to get just the year. 

             year = int(filename[9:13])

             for line in yearFile:

              # [0] = Name, [1] = Gender, [2] = Total Occurrences of Name
              list_of_line = line.replace(',',' ').split()

              dic[ list_of_line[0] ][ year ] = int( list_of_line[2] )

  return dic

现在如果我这样做：

d = files_to_dict( 'names' )
print ( d['Mary'] )

我得到这样的东西。（仅显示最近10年）：

...2004: 31, 2005: 10, 2006: 10, 2007: 10, 2008: 3490, 2009: 3154, 2010: 
   2862, 2011: 2701, 2012: 6, 2013: 2632, 2014: 5}

在这里，2008年至2011年都有正确的计数，因为它们出现在文件中，但是 2004年 - 2007年以及2012年和2014年都有不正确的数量。这就是整个输出的情况。大多数计数都是错误的，只有十几个包含正确的名称。我真的不明白为什么会这样。有没有人有任何想法？

嵌套字典以奇怪的方式表现

0 个答案: