如何在CSV中嵌套字典中定义列表

时间:2017-08-24 17:00:26

标签: python python-2.7 csv dictionary

我有一个这样形成的csv,注意同时有多个记录,并且在该时间范围内有多个记录具有相同的data4值:

template<typename T, size_t N>
std::array<T,N> operator&(const std::array<T,N>& a, std::array<T,N>& b) {
    std::array<T,N> c;
    std::transform(std::begin(a), std::end(a), std::begin(b), std::begin(c), [](T i1, T i2) { return i1 & i2; });
    return c;
}

我试图编写一个函数来将这个csv读入一个嵌套字典,该字典使用Time列和data4列作为嵌套键。到目前为止我所拥有的是:

Time,data1,data2,data3,data4
8/12/2017 8:37:11.719,4435441.97983871,321106.049167927,1260.354,64
8/12/2017 8:37:11.719,4435451.97715054,321346.085476551,1260.354,60
8/12/2017 8:37:11.719,4435461.97446237,321096.047655068,1260.354,64
8/12/2017 8:37:11.719,4435461.97446237,321106.049167927,1260.354,64
8/12/2017 8:37:26.919,4436121.79704301,324496.562027231,1260.354,96
8/12/2017 8:37:26.919,4436121.79704301,324506.563540091,1260.354,96
8/12/2017 8:37:26.919,4436121.79704301,324546.569591528,1260.354,56
8/12/2017 8:37:26.919,4436121.79704301,324646.584720121,1260.354,64

返回:

def build_dict(source_file):
    new_dict = defaultdict(dict)

    headers = ['Time','data1','data2','data3','data4']
    with open(source_file, 'rb') as fp:
        reader = csv.DictReader(fp, fieldnames=headers, dialect='excel',
                                skipinitialspace=True)
        for rowdict in reader:
            if None in rowdict:
                del rowdict[None]
            Time = rowdict.pop("Time")
            data4 = int(rowdict.pop("data4"))
            dict[Time][data4] = rowdict
    return dict(new_dict)

它几乎可以满足我的需要,但它会用Time覆盖前一行数据,而data4是相同的。我想我需要将data1,data2和data3存储在一个列表中,但不知道该怎么做。

这就是我希望我的字典看起来像这样每个时间段我可以按data4值分组数据:

new_dict = {
    '8/12/2017 8:37:11.719' : {
        64: {'data3': '1260.354', 'data1': '4435441.97983871', 'data2': '321106.049167927'},
        60: {'data3': '1260.354', 'data1': '4435451.97715054', 'data2': '321346.085476551'}
    }
}

2 个答案:

答案 0 :(得分:0)

我建议使用Pandas库,因为它提供了通过Pandas Dataframe读取和分组CSV文件的好方法。

import pandas as pd

# read the CSV file
df = pd.read_csv("test.csv")

# group by the desired columns
gb = df.groupby(['Time', 'data4'])

这将返回GroupBy对象,而键是timestamp和date4的元组,每个组的值是包含匹配/值的新Dataframe。现在你有三个选择:

# option 1
list(gb)

这给了你:

[(('8/12/2017 8:37:11.719', 60),
                      Time         data1          data2     data3  data4
  1  8/12/2017 8:37:11.719  4.435452e+06  321346.085477  1260.354     60),
 (('8/12/2017 8:37:11.719', 64),
                      Time         data1          data2     data3  data4
  0  8/12/2017 8:37:11.719  4.435442e+06  321106.049168  1260.354     64
  2  8/12/2017 8:37:11.719  4.435462e+06  321096.047655  1260.354     64
  3  8/12/2017 8:37:11.719  4.435462e+06  321106.049168  1260.354     64),
 (('8/12/2017 8:37:26.919', 56),
                      Time         data1          data2     data3  data4
  6  8/12/2017 8:37:26.919  4.436122e+06  324546.569592  1260.354     56),
 (('8/12/2017 8:37:26.919', 64),
                      Time         data1         data2     data3  data4
  7  8/12/2017 8:37:26.919  4.436122e+06  324646.58472  1260.354     64),
 (('8/12/2017 8:37:26.919', 96),
                      Time         data1          data2     data3  data4
  4  8/12/2017 8:37:26.919  4.436122e+06  324496.562027  1260.354     96
  5  8/12/2017 8:37:26.919  4.436122e+06  324506.563540  1260.354     96)]

您还可以使用产生可比结果的字典:

# option 2
dict(list(gb))

或者您遍历各组并对每个组的行执行任何操作

# option 3
result = {}
for name, df_group in gb:
    timestamp, date4 = name
    outer_dict = result.get(timestamp, {})
    inner_dict = df_group.T.to_dict()
    #inner_dict = df_group.to_dict(orient="index")
    #inner_dict = df_group.values.tolist()

    outer_dict[date4] = inner_dict
    result[timestamp] = outer_dict

print(result)

其中包含以下内容。你可以放弃一些列,如索引,时间戳和日期4。

{'8/12/2017 8:37:11.719': {60: {1: {'Time': '8/12/2017 8:37:11.719',
    'data1': 4435451.97715054,
    'data2': 321346.08547655103,
    'data3': 1260.354,
    'data4': 60}},
  64: {0: {'Time': '8/12/2017 8:37:11.719',
    'data1': 4435441.97983871,
    'data2': 321106.049167927,
    'data3': 1260.354,
    'data4': 64},
   2: {'Time': '8/12/2017 8:37:11.719',
    'data1': 4435461.97446237,
    'data2': 321096.047655068,
    'data3': 1260.354,
    'data4': 64},
   3: {'Time': '8/12/2017 8:37:11.719',
    'data1': 4435461.97446237,
    'data2': 321106.049167927,
    'data3': 1260.354,
    'data4': 64}}},
 '8/12/2017 8:37:26.919': {56: {6: {'Time': '8/12/2017 8:37:26.919',
    'data1': 4436121.79704301,
    'data2': 324546.569591528,
    'data3': 1260.354,
    'data4': 56}},
  64: {7: {'Time': '8/12/2017 8:37:26.919',
    'data1': 4436121.79704301,
    'data2': 324646.584720121,
    'data3': 1260.354,
    'data4': 64}},
  96: {4: {'Time': '8/12/2017 8:37:26.919',
    'data1': 4436121.79704301,
    'data2': 324496.56202723103,
    'data3': 1260.354,
    'data4': 96},
   5: {'Time': '8/12/2017 8:37:26.919',
    'data1': 4436121.79704301,
    'data2': 324506.56354009104,
    'data3': 1260.354,
    'data4': 96}}}}

希望你明白了。

答案 1 :(得分:0)

嗯,这是经典的用例:分组

因此,更简单的方法是使用itertools.groupby将您的dict分组为“时间”。

reader = csv.DictReader(fp, dialect='excel', skipinitialspace=True)
headers = next(reader)
new_dict = {}
for group, records in itertools.groupby(reader, key=operator.itemgetter('Time')):
    new_dict[group] = list(records)

你得到:

{'8/12/2017 8:37:11.719': [{'Time': '8/12/2017 8:37:11.719',
                            'data1': '4435451.97715054',
                            'data2': '321346.085476551',
                            'data3': '1260.354',
                            'data4': '60'},
                           {'Time': '8/12/2017 8:37:11.719',
                            'data1': '4435461.97446237',
                            'data2': '321096.047655068',
                            'data3': '1260.354',
                            'data4': '64'},
                           {'Time': '8/12/2017 8:37:11.719',
                            'data1': '4435461.97446237',
                            'data2': '321106.049167927',
                            'data3': '1260.354',
                            'data4': '64'}],
 '8/12/2017 8:37:26.919': [{'Time': '8/12/2017 8:37:26.919',
                            'data1': '4436121.79704301',
                            'data2': '324496.562027231',
                            'data3': '1260.354',
                            'data4': '96'},
                           {'Time': '8/12/2017 8:37:26.919',
                            'data1': '4436121.79704301',
                            'data2': '324506.563540091',
                            'data3': '1260.354',
                            'data4': '96'},
                           {'Time': '8/12/2017 8:37:26.919',
                            'data1': '4436121.79704301',
                            'data2': '324546.569591528',
                            'data3': '1260.354',
                            'data4': '56'},
                           {'Time': '8/12/2017 8:37:26.919',
                            'data1': '4436121.79704301',
                            'data2': '324646.584720121',
                            'data3': '1260.354',
                            'data4': '64'}]}

您还可以使用理解词典:

new_dict = {group: list(records)
            for group, records in itertools.groupby(reader, key=operator.itemgetter('Time'))}

如果您需要使用“time”和“data4”进行分组,则需要更改分组

for group, records in itertools.groupby(reader, key=lambda v: (v["Time"], int(v["data4"]))):
    new_dict[group] = list(records)

结果是:

{('8/12/2017 8:37:11.719', 60): [{'Time': '8/12/2017 8:37:11.719',
                                  'data1': '4435451.97715054',
                                  'data2': '321346.085476551',
                                  'data3': '1260.354',
                                  'data4': '60'}],
 ('8/12/2017 8:37:11.719', 64): [{'Time': '8/12/2017 8:37:11.719',
                                  'data1': '4435461.97446237',
                                  'data2': '321096.047655068',
                                  'data3': '1260.354',
                                  'data4': '64'},
                                 {'Time': '8/12/2017 8:37:11.719',
                                  'data1': '4435461.97446237',
                                  'data2': '321106.049167927',
                                  'data3': '1260.354',
                                  'data4': '64'}],
 ('8/12/2017 8:37:26.919', 56): [{'Time': '8/12/2017 8:37:26.919',
                                  'data1': '4436121.79704301',
                                  'data2': '324546.569591528',
                                  'data3': '1260.354',
                                  'data4': '56'}],
 ('8/12/2017 8:37:26.919', 64): [{'Time': '8/12/2017 8:37:26.919',
                                  'data1': '4436121.79704301',
                                  'data2': '324646.584720121',
                                  'data3': '1260.354',
                                  'data4': '64'}],
 ('8/12/2017 8:37:26.919', 96): [{'Time': '8/12/2017 8:37:26.919',
                                  'data1': '4436121.79704301',
                                  'data2': '324496.562027231',
                                  'data3': '1260.354',
                                  'data4': '96'},
                                 {'Time': '8/12/2017 8:37:26.919',
                                  'data1': '4436121.79704301',
                                  'data2': '324506.563540091',
                                  'data3': '1260.354',
                                  'data4': '96'}]}

如果你需要2级分组:首先是“时间”,然后是“数据4”,你需要2个循环:

new_dict = {}
for group1, records1 in itertools.groupby(reader, key=operator.itemgetter("Time")):
    new_dict[group1] = {}
    for group2, records2 in itertools.groupby(records1, key=lambda v: int(v["data4"])):
        new_dict[group1][group2] = list(records2)

结果:

{'8/12/2017 8:37:11.719': {60: [{'Time': '8/12/2017 8:37:11.719',
                                 'data1': '4435451.97715054',
                                 'data2': '321346.085476551',
                                 'data3': '1260.354',
                                 'data4': '60'}],
                           64: [{'Time': '8/12/2017 8:37:11.719',
                                 'data1': '4435461.97446237',
                                 'data2': '321096.047655068',
                                 'data3': '1260.354',
                                 'data4': '64'},
                                {'Time': '8/12/2017 8:37:11.719',
                                 'data1': '4435461.97446237',
                                 'data2': '321106.049167927',
                                 'data3': '1260.354',
                                 'data4': '64'}]},
 '8/12/2017 8:37:26.919': {56: [{'Time': '8/12/2017 8:37:26.919',
                                 'data1': '4436121.79704301',
                                 'data2': '324546.569591528',
                                 'data3': '1260.354',
                                 'data4': '56'}],
                           64: [{'Time': '8/12/2017 8:37:26.919',
                                 'data1': '4436121.79704301',
                                 'data2': '324646.584720121',
                                 'data3': '1260.354',
                                 'data4': '64'}],
                           96: [{'Time': '8/12/2017 8:37:26.919',
                                 'data1': '4436121.79704301',
                                 'data2': '324496.562027231',
                                 'data3': '1260.354',
                                 'data4': '96'},
                                {'Time': '8/12/2017 8:37:26.919',
                                 'data1': '4436121.79704301',
                                 'data2': '324506.563540091',
                                 'data3': '1260.354',
                                 'data4': '96'}]}}