我正在通过读取大型csv文件来创建多层嵌套字典。文件的内容采用以下格式,该格式存储与一本唯一书有关的相关信息。我们可以假设每一行有6列(作者,标题,年份,类别,URL,引用);所有列条目都具有相同的格式。例如:
Author,Title,Year,Category,Url,Citations
"jk rowling, etc....",goblet of fire,1973,magic: fantasy: english literature,http://doi.acm.org/10.1145/800010.808066,6
"Weiner, Leonard H.",cracking the coding interview,1973,LA: assessment: other,http://doi.acm.org/10.1145/800010.808105,2
"Tolkien",hobbit,1953,magic: fantasy: medieval,http://doi.acm.org/10.1145/800010.808066,6
我希望输出匹配csv文件中每一行的解析方式,类似于以下内容: *(注意:嵌套字典的数量取决于csv类别标题下的书籍类别。键基于连续的类别(顺序事项),并以':'分隔符分隔。考虑每行类别的顺序在csv文件中作为路径目录;多个文件可以在同一点之前具有相同的路径目录,也可以具有相同的路径目录并放置在同一文件夹中。
results = {'1973':{
"magic": {
"fantasy": {
"English literature": {
"name": "goblet of fire",
"citations": 6,
"url": "http://doi.acm.org/10.1145/800010.808066"
}
},
"medieval": {
"name": "The Hobbit",
"citations": 7,
"url": "http://doi.acm.org/10.1145/800fdfdffd010.808066"
}
}
},
'1953':{
"la": {
"assessment": {
"other": {
"name": "cracking the coding interview",
"citations": 6,
"url": "http://doi.acm.org/10.1145/800010.808105"
}
}
}
}
}
很明显,有些书会共享共同的连续类别,就像我上面显示的示例一样。有些书可能还会共享完全相同的连续类别。我认为我应该递归地遍历csv中每行类别的字符串,要么创建新的子字典,使其偏离先前存在的类别顺序,然后在没有更多要检查的连续类别时创建该书的字典表示形式。我只是不确定该如何开始。
这是我到目前为止所拥有的,这只是读取csv文件的标准设置:
with open(DATA_FILE, 'r') as data_file:
data = csv.reader(data_file)
本质上,我想使用嵌套字典,相对类别路径(即magic:fantasy:etc ...)来确定要遍历/创建哪个子树的csv的树表示形式。如果两本或两本以上的书具有相同的连续路径,我想使所有这些书的叶子成为各自的键,而不是每当新书具有相同的类别路径时覆盖每本书(叶)。叶子代表了csv中每行提及的书籍的字典形式。
答案 0 :(得分:1)
您可以按类别对数据进行分组(使用简单的字典,因为您提到不能使用csv
以外的任何模块),然后应用递归:
import csv
_, *data = csv.reader(open('filename.csv'))
new_data = [[i[3].split(': '), *i[4:], *i[:3]] for i in data]
def group(d):
_d = {}
for a, *b in d:
if a[0] not in _d:
_d[a[0]] = [[a[1:], *b]]
else:
_d[a[0]].append([a[1:], *b])
r = {a:{'books':[{'name':c[-2], 'citations':c[2], 'url':c[1], 'author':c[3]} for c in b if not c[0]], **(lambda x:{} if not x else group(x))([c for c in b if c[0]])} for a, b in _d.items()}
return {a:{c:d for c, d in b.items() if d} for a, b in r.items()}
import json
print(json.dumps(group(new_data), indent=4))
输出:
{
"magic": {
"fantasy": {
"english literature": {
"books": [
{
"name": "goblet of fire",
"citations": "6",
"url": "http://doi.acm.org/10.1145/800010.808066",
"author": "jk rowling, etc...."
}
]
},
"medieval": {
"books": [
{
"name": "hobbit",
"citations": "6",
"url": "http://doi.acm.org/10.1145/800010.808066",
"author": "Tolkien"
}
]
}
}
},
"LA": {
"assessment": {
"other": {
"books": [
{
"name": "cracking the coding interview",
"citations": "2",
"url": "http://doi.acm.org/10.1145/800010.808105",
"author": "Weiner, Leonard H."
}
]
}
}
}
}
编辑:按发布日期分组:
import csv
_, *data = csv.reader(open('filename.csv'))
new_data = [[i[3].split(': '), *i[4:], *i[:3]] for i in data]
_data = {}
for i in new_data:
if i[-1] not in _data:
_data[i[-1]] = [i]
else:
_data[i[-1]].append(i)
final_result = {a:group(b) for a, b in _data.items()}
输出:
{
"1973": {
"magic": {
"fantasy": {
"english literature": {
"books": [
{
"name": "goblet of fire",
"citations": "6",
"url": "http://doi.acm.org/10.1145/800010.808066",
"author": "jk rowling, etc...."
}
]
}
}
},
"LA": {
"assessment": {
"other": {
"books": [
{
"name": "cracking the coding interview",
"citations": "2",
"url": "http://doi.acm.org/10.1145/800010.808105",
"author": "Weiner, Leonard H."
}
]
}
}
}
},
"1953": {
"magic": {
"fantasy": {
"medieval": {
"books": [
{
"name": "hobbit",
"citations": "6",
"url": "http://doi.acm.org/10.1145/800010.808066",
"author": "Tolkien"
}
]
}
}
}
}
}
答案 1 :(得分:0)
答案 2 :(得分:0)
您可以执行以下操作:
import pandas as pd
df = pd.read_csv('yourcsv.csv', sep=',')
接下来,您要隔离Category
列,并用列拆分其内容:
cols_no_categ = list(df.columns)
cols_no_categ.remove('Category')
category = df['Category']
DICT = {}
for c in category:
dicto = df[df.Category == c, cols_no_categ].to_dict()
s = c.split(': ')
DICT[s[0]][s[1]][s[2]] = dicto