使用Python从csv文件创建深度嵌套的字典

时间:2019-04-01 14:44:52

标签: python csv dictionary tree

我正在通过读取大型csv文件来创建多层嵌套字典。文件的内容采用以下格式,该格式存储与一本唯一书有关的相关信息。我们可以假设每一行有6列(作者,标题,年份,类别,URL,引用);所有列条目都具有相同的格式。例如:

Author,Title,Year,Category,Url,Citations
"jk rowling, etc....",goblet of fire,1973,magic: fantasy: english literature,http://doi.acm.org/10.1145/800010.808066,6
"Weiner, Leonard H.",cracking the coding interview,1973,LA: assessment: other,http://doi.acm.org/10.1145/800010.808105,2
"Tolkien",hobbit,1953,magic: fantasy: medieval,http://doi.acm.org/10.1145/800010.808066,6

我希望输出匹配csv文件中每一行的解析方式,类似于以下内容: *(注意:嵌套字典的数量取决于csv类别标题下的书籍类别。键基于连续的类别(顺序事项),并以':'分隔符分隔。考虑每行类别的顺序在csv文件中作为路径目录;多个文件可以在同一点之前具有相同的路径目录,也可以具有相同的路径目录并放置在同一文件夹中。

results = {'1973':{
    "magic": {
        "fantasy": {
            "English literature": {
                "name": "goblet of fire",
                "citations": 6,
                "url": "http://doi.acm.org/10.1145/800010.808066"
            }
        },
        "medieval": {
            "name": "The Hobbit",
            "citations": 7,
            "url": "http://doi.acm.org/10.1145/800fdfdffd010.808066"
        }
       }
    },
    '1953':{
    "la": {
        "assessment": {
            "other": {
                "name": "cracking the coding interview",
                "citations": 6,
                "url": "http://doi.acm.org/10.1145/800010.808105"
            }
        }
    }
}
}

很明显,有些书会共享共同的连续类别,就像我上面显示的示例一样。有些书可能还会共享完全相同的连续类别。我认为我应该递归地遍历csv中每行类别的字符串,要么创建新的子字典,使其偏离先前存在的类别顺序,然后在没有更多要检查的连续类别时创建该书的字典表示形式。我只是不确定该如何开始。

这是我到目前为止所拥有的,这只是读取csv文件的标准设置:

  with open(DATA_FILE, 'r') as data_file:
        data = csv.reader(data_file)

本质上,我想使用嵌套字典,相对类别路径(即magic:fantasy:etc ...)来确定要遍历/创建哪个子树的csv的树表示形式。如果两本或两本以上的书具有相同的连续路径,我想使所有这些书的叶子成为各自的键,而不是每当新书具有相同的类别路径时覆盖每本书(叶)。叶子代表了csv中每行提及的书籍的字典形式。

3 个答案:

答案 0 :(得分:1)

您可以按类别对数据进行分组(使用简单的字典,因为您提到不能使用csv以外的任何模块),然后应用递归:

import csv
_, *data = csv.reader(open('filename.csv'))
new_data = [[i[3].split(': '), *i[4:], *i[:3]] for i in data]
def group(d):
  _d = {}
  for a, *b in d:
    if a[0] not in _d:
      _d[a[0]] = [[a[1:], *b]]
    else:
      _d[a[0]].append([a[1:], *b])
  r = {a:{'books':[{'name':c[-2], 'citations':c[2], 'url':c[1], 'author':c[3]} for c in b if not c[0]], **(lambda x:{} if not x else group(x))([c for c in b if c[0]])} for a, b in _d.items()}
  return {a:{c:d for c, d in b.items() if d} for a, b in r.items()}

import json
print(json.dumps(group(new_data), indent=4))

输出:

{
  "magic": {
    "fantasy": {
        "english literature": {
            "books": [
                {
                    "name": "goblet of fire",
                    "citations": "6",
                    "url": "http://doi.acm.org/10.1145/800010.808066",
                    "author": "jk rowling, etc...."
                }
            ]
        },
        "medieval": {
            "books": [
                {
                    "name": "hobbit",
                    "citations": "6",
                    "url": "http://doi.acm.org/10.1145/800010.808066",
                    "author": "Tolkien"
                }
            ]
        }
    }
},
"LA": {
    "assessment": {
        "other": {
            "books": [
                {
                    "name": "cracking the coding interview",
                    "citations": "2",
                    "url": "http://doi.acm.org/10.1145/800010.808105",
                    "author": "Weiner, Leonard H."
                }
            ]
         }
      }
   }
}

编辑:按发布日期分组:

import csv
_, *data = csv.reader(open('filename.csv'))
new_data = [[i[3].split(': '), *i[4:], *i[:3]] for i in data]
_data = {}
for i in new_data:
  if i[-1] not in _data:
    _data[i[-1]] = [i]
  else:
    _data[i[-1]].append(i)

final_result = {a:group(b) for a, b in _data.items()}

输出:

{
 "1973": {
    "magic": {
        "fantasy": {
            "english literature": {
                "books": [
                    {
                        "name": "goblet of fire",
                        "citations": "6",
                        "url": "http://doi.acm.org/10.1145/800010.808066",
                        "author": "jk rowling, etc...."
                    }
                ]
            }
        }
    },
    "LA": {
        "assessment": {
            "other": {
                "books": [
                    {
                        "name": "cracking the coding interview",
                        "citations": "2",
                        "url": "http://doi.acm.org/10.1145/800010.808105",
                        "author": "Weiner, Leonard H."
                    }
                ]
            }
        }
    }
 },
 "1953": {
    "magic": {
        "fantasy": {
            "medieval": {
                "books": [
                    {
                        "name": "hobbit",
                        "citations": "6",
                        "url": "http://doi.acm.org/10.1145/800010.808066",
                        "author": "Tolkien"
                    }
                ]
            }
         }
      }
   }
}

答案 1 :(得分:0)

  1. 按类别嵌套类别
  2. 将CSV解析为熊猫数据框
  3. 按类别按类别分组
  4. 使用to_dict()在groupby循环中转换为dict

答案 2 :(得分:0)

您可以执行以下操作:

import pandas as pd
df = pd.read_csv('yourcsv.csv', sep=',')

接下来,您要隔离Category列,并用列拆分其内容:

cols_no_categ = list(df.columns)
cols_no_categ.remove('Category')
category = df['Category']
DICT = {}
for c in category:
    dicto = df[df.Category == c, cols_no_categ].to_dict()
    s = c.split(': ')
    DICT[s[0]][s[1]][s[2]] = dicto