Question

我正在通过读取大型csv文件来创建多层嵌套字典。文件的内容采用以下格式，该格式存储与一本唯一书有关的相关信息。我们可以假设每一行有6列（作者，标题，年份，类别，URL，引用）；所有列条目都具有相同的格式。例如：

Author,Title,Year,Category,Url,Citations
"jk rowling, etc....",goblet of fire,1973,magic: fantasy: english literature,http://doi.acm.org/10.1145/800010.808066,6
"Weiner, Leonard H.",cracking the coding interview,1973,LA: assessment: other,http://doi.acm.org/10.1145/800010.808105,2
"Tolkien",hobbit,1953,magic: fantasy: medieval,http://doi.acm.org/10.1145/800010.808066,6

我希望输出匹配csv文件中每一行的解析方式，类似于以下内容： *（注意：嵌套字典的数量取决于csv类别标题下的书籍类别。键基于连续的类别（顺序事项），并以'：'分隔符分隔。考虑每行类别的顺序在csv文件中作为路径目录；多个文件可以在同一点之前具有相同的路径目录，也可以具有相同的路径目录并放置在同一文件夹中。

results = {'1973':{
    "magic": {
        "fantasy": {
            "English literature": {
                "name": "goblet of fire",
                "citations": 6,
                "url": "http://doi.acm.org/10.1145/800010.808066"
            }
        },
        "medieval": {
            "name": "The Hobbit",
            "citations": 7,
            "url": "http://doi.acm.org/10.1145/800fdfdffd010.808066"
        }
       }
    },
    '1953':{
    "la": {
        "assessment": {
            "other": {
                "name": "cracking the coding interview",
                "citations": 6,
                "url": "http://doi.acm.org/10.1145/800010.808105"
            }
        }
    }
}
}

很明显，有些书会共享共同的连续类别，就像我上面显示的示例一样。有些书可能还会共享完全相同的连续类别。我认为我应该递归地遍历csv中每行类别的字符串，要么创建新的子字典，使其偏离先前存在的类别顺序，然后在没有更多要检查的连续类别时创建该书的字典表示形式。我只是不确定该如何开始。

这是我到目前为止所拥有的，这只是读取csv文件的标准设置：

  with open(DATA_FILE, 'r') as data_file:
        data = csv.reader(data_file)

本质上，我想使用嵌套字典，相对类别路径（即magic：fantasy：etc ...）来确定要遍历/创建哪个子树的csv的树表示形式。如果两本或两本以上的书具有相同的连续路径，我想使所有这些书的叶子成为各自的键，而不是每当新书具有相同的类别路径时覆盖每本书（叶）。叶子代表了csv中每行提及的书籍的字典形式。

Answer 1

您可以按类别对数据进行分组（使用简单的字典，因为您提到不能使用csv以外的任何模块），然后应用递归：

import csv
_, *data = csv.reader(open('filename.csv'))
new_data = [[i[3].split(': '), *i[4:], *i[:3]] for i in data]
def group(d):
  _d = {}
  for a, *b in d:
    if a[0] not in _d:
      _d[a[0]] = [[a[1:], *b]]
    else:
      _d[a[0]].append([a[1:], *b])
  r = {a:{'books':[{'name':c[-2], 'citations':c[2], 'url':c[1], 'author':c[3]} for c in b if not c[0]], **(lambda x:{} if not x else group(x))([c for c in b if c[0]])} for a, b in _d.items()}
  return {a:{c:d for c, d in b.items() if d} for a, b in r.items()}

import json
print(json.dumps(group(new_data), indent=4))

输出：

{
  "magic": {
    "fantasy": {
        "english literature": {
            "books": [
                {
                    "name": "goblet of fire",
                    "citations": "6",
                    "url": "http://doi.acm.org/10.1145/800010.808066",
                    "author": "jk rowling, etc...."
                }
            ]
        },
        "medieval": {
            "books": [
                {
                    "name": "hobbit",
                    "citations": "6",
                    "url": "http://doi.acm.org/10.1145/800010.808066",
                    "author": "Tolkien"
                }
            ]
        }
    }
},
"LA": {
    "assessment": {
        "other": {
            "books": [
                {
                    "name": "cracking the coding interview",
                    "citations": "2",
                    "url": "http://doi.acm.org/10.1145/800010.808105",
                    "author": "Weiner, Leonard H."
                }
            ]
         }
      }
   }
}

编辑：按发布日期分组：

import csv
_, *data = csv.reader(open('filename.csv'))
new_data = [[i[3].split(': '), *i[4:], *i[:3]] for i in data]
_data = {}
for i in new_data:
  if i[-1] not in _data:
    _data[i[-1]] = [i]
  else:
    _data[i[-1]].append(i)

final_result = {a:group(b) for a, b in _data.items()}

输出：

{
 "1973": {
    "magic": {
        "fantasy": {
            "english literature": {
                "books": [
                    {
                        "name": "goblet of fire",
                        "citations": "6",
                        "url": "http://doi.acm.org/10.1145/800010.808066",
                        "author": "jk rowling, etc...."
                    }
                ]
            }
        }
    },
    "LA": {
        "assessment": {
            "other": {
                "books": [
                    {
                        "name": "cracking the coding interview",
                        "citations": "2",
                        "url": "http://doi.acm.org/10.1145/800010.808105",
                        "author": "Weiner, Leonard H."
                    }
                ]
            }
        }
    }
 },
 "1953": {
    "magic": {
        "fantasy": {
            "medieval": {
                "books": [
                    {
                        "name": "hobbit",
                        "citations": "6",
                        "url": "http://doi.acm.org/10.1145/800010.808066",
                        "author": "Tolkien"
                    }
                ]
            }
         }
      }
   }
}

Answer 2

按类别嵌套类别
将CSV解析为熊猫数据框
按类别按类别分组
使用to_dict（）在groupby循环中转换为dict

Answer 3

您可以执行以下操作：

import pandas as pd
df = pd.read_csv('yourcsv.csv', sep=',')

接下来，您要隔离Category列，并用列拆分其内容：

cols_no_categ = list(df.columns)
cols_no_categ.remove('Category')
category = df['Category']
DICT = {}
for c in category:
    dicto = df[df.Category == c, cols_no_categ].to_dict()
    s = c.split(': ')
    DICT[s[0]][s[1]][s[2]] = dicto

使用Python从csv文件创建深度嵌套的字典

3 个答案: