我有一个这样形成的csv,注意同时有多个记录,并且在该时间范围内有多个记录具有相同的data4值:
template<typename T, size_t N>
std::array<T,N> operator&(const std::array<T,N>& a, std::array<T,N>& b) {
std::array<T,N> c;
std::transform(std::begin(a), std::end(a), std::begin(b), std::begin(c), [](T i1, T i2) { return i1 & i2; });
return c;
}
我试图编写一个函数来将这个csv读入一个嵌套字典,该字典使用Time列和data4列作为嵌套键。到目前为止我所拥有的是:
Time,data1,data2,data3,data4
8/12/2017 8:37:11.719,4435441.97983871,321106.049167927,1260.354,64
8/12/2017 8:37:11.719,4435451.97715054,321346.085476551,1260.354,60
8/12/2017 8:37:11.719,4435461.97446237,321096.047655068,1260.354,64
8/12/2017 8:37:11.719,4435461.97446237,321106.049167927,1260.354,64
8/12/2017 8:37:26.919,4436121.79704301,324496.562027231,1260.354,96
8/12/2017 8:37:26.919,4436121.79704301,324506.563540091,1260.354,96
8/12/2017 8:37:26.919,4436121.79704301,324546.569591528,1260.354,56
8/12/2017 8:37:26.919,4436121.79704301,324646.584720121,1260.354,64
返回:
def build_dict(source_file):
new_dict = defaultdict(dict)
headers = ['Time','data1','data2','data3','data4']
with open(source_file, 'rb') as fp:
reader = csv.DictReader(fp, fieldnames=headers, dialect='excel',
skipinitialspace=True)
for rowdict in reader:
if None in rowdict:
del rowdict[None]
Time = rowdict.pop("Time")
data4 = int(rowdict.pop("data4"))
dict[Time][data4] = rowdict
return dict(new_dict)
它几乎可以满足我的需要,但它会用Time覆盖前一行数据,而data4是相同的。我想我需要将data1,data2和data3存储在一个列表中,但不知道该怎么做。
这就是我希望我的字典看起来像这样每个时间段我可以按data4值分组数据:
new_dict = {
'8/12/2017 8:37:11.719' : {
64: {'data3': '1260.354', 'data1': '4435441.97983871', 'data2': '321106.049167927'},
60: {'data3': '1260.354', 'data1': '4435451.97715054', 'data2': '321346.085476551'}
}
}
答案 0 :(得分:0)
我建议使用Pandas库,因为它提供了通过Pandas Dataframe读取和分组CSV文件的好方法。
import pandas as pd
# read the CSV file
df = pd.read_csv("test.csv")
# group by the desired columns
gb = df.groupby(['Time', 'data4'])
这将返回GroupBy对象,而键是timestamp和date4的元组,每个组的值是包含匹配/值的新Dataframe。现在你有三个选择:
# option 1
list(gb)
这给了你:
[(('8/12/2017 8:37:11.719', 60),
Time data1 data2 data3 data4
1 8/12/2017 8:37:11.719 4.435452e+06 321346.085477 1260.354 60),
(('8/12/2017 8:37:11.719', 64),
Time data1 data2 data3 data4
0 8/12/2017 8:37:11.719 4.435442e+06 321106.049168 1260.354 64
2 8/12/2017 8:37:11.719 4.435462e+06 321096.047655 1260.354 64
3 8/12/2017 8:37:11.719 4.435462e+06 321106.049168 1260.354 64),
(('8/12/2017 8:37:26.919', 56),
Time data1 data2 data3 data4
6 8/12/2017 8:37:26.919 4.436122e+06 324546.569592 1260.354 56),
(('8/12/2017 8:37:26.919', 64),
Time data1 data2 data3 data4
7 8/12/2017 8:37:26.919 4.436122e+06 324646.58472 1260.354 64),
(('8/12/2017 8:37:26.919', 96),
Time data1 data2 data3 data4
4 8/12/2017 8:37:26.919 4.436122e+06 324496.562027 1260.354 96
5 8/12/2017 8:37:26.919 4.436122e+06 324506.563540 1260.354 96)]
您还可以使用产生可比结果的字典:
# option 2
dict(list(gb))
或者您遍历各组并对每个组的行执行任何操作
# option 3
result = {}
for name, df_group in gb:
timestamp, date4 = name
outer_dict = result.get(timestamp, {})
inner_dict = df_group.T.to_dict()
#inner_dict = df_group.to_dict(orient="index")
#inner_dict = df_group.values.tolist()
outer_dict[date4] = inner_dict
result[timestamp] = outer_dict
print(result)
其中包含以下内容。你可以放弃一些列,如索引,时间戳和日期4。
{'8/12/2017 8:37:11.719': {60: {1: {'Time': '8/12/2017 8:37:11.719',
'data1': 4435451.97715054,
'data2': 321346.08547655103,
'data3': 1260.354,
'data4': 60}},
64: {0: {'Time': '8/12/2017 8:37:11.719',
'data1': 4435441.97983871,
'data2': 321106.049167927,
'data3': 1260.354,
'data4': 64},
2: {'Time': '8/12/2017 8:37:11.719',
'data1': 4435461.97446237,
'data2': 321096.047655068,
'data3': 1260.354,
'data4': 64},
3: {'Time': '8/12/2017 8:37:11.719',
'data1': 4435461.97446237,
'data2': 321106.049167927,
'data3': 1260.354,
'data4': 64}}},
'8/12/2017 8:37:26.919': {56: {6: {'Time': '8/12/2017 8:37:26.919',
'data1': 4436121.79704301,
'data2': 324546.569591528,
'data3': 1260.354,
'data4': 56}},
64: {7: {'Time': '8/12/2017 8:37:26.919',
'data1': 4436121.79704301,
'data2': 324646.584720121,
'data3': 1260.354,
'data4': 64}},
96: {4: {'Time': '8/12/2017 8:37:26.919',
'data1': 4436121.79704301,
'data2': 324496.56202723103,
'data3': 1260.354,
'data4': 96},
5: {'Time': '8/12/2017 8:37:26.919',
'data1': 4436121.79704301,
'data2': 324506.56354009104,
'data3': 1260.354,
'data4': 96}}}}
希望你明白了。
答案 1 :(得分:0)
嗯,这是经典的用例:分组
因此,更简单的方法是使用itertools.groupby将您的dict
分组为“时间”。
reader = csv.DictReader(fp, dialect='excel', skipinitialspace=True)
headers = next(reader)
new_dict = {}
for group, records in itertools.groupby(reader, key=operator.itemgetter('Time')):
new_dict[group] = list(records)
你得到:
{'8/12/2017 8:37:11.719': [{'Time': '8/12/2017 8:37:11.719',
'data1': '4435451.97715054',
'data2': '321346.085476551',
'data3': '1260.354',
'data4': '60'},
{'Time': '8/12/2017 8:37:11.719',
'data1': '4435461.97446237',
'data2': '321096.047655068',
'data3': '1260.354',
'data4': '64'},
{'Time': '8/12/2017 8:37:11.719',
'data1': '4435461.97446237',
'data2': '321106.049167927',
'data3': '1260.354',
'data4': '64'}],
'8/12/2017 8:37:26.919': [{'Time': '8/12/2017 8:37:26.919',
'data1': '4436121.79704301',
'data2': '324496.562027231',
'data3': '1260.354',
'data4': '96'},
{'Time': '8/12/2017 8:37:26.919',
'data1': '4436121.79704301',
'data2': '324506.563540091',
'data3': '1260.354',
'data4': '96'},
{'Time': '8/12/2017 8:37:26.919',
'data1': '4436121.79704301',
'data2': '324546.569591528',
'data3': '1260.354',
'data4': '56'},
{'Time': '8/12/2017 8:37:26.919',
'data1': '4436121.79704301',
'data2': '324646.584720121',
'data3': '1260.354',
'data4': '64'}]}
您还可以使用理解词典:
new_dict = {group: list(records)
for group, records in itertools.groupby(reader, key=operator.itemgetter('Time'))}
如果您需要使用“time”和“data4”进行分组,则需要更改分组键:
for group, records in itertools.groupby(reader, key=lambda v: (v["Time"], int(v["data4"]))):
new_dict[group] = list(records)
结果是:
{('8/12/2017 8:37:11.719', 60): [{'Time': '8/12/2017 8:37:11.719',
'data1': '4435451.97715054',
'data2': '321346.085476551',
'data3': '1260.354',
'data4': '60'}],
('8/12/2017 8:37:11.719', 64): [{'Time': '8/12/2017 8:37:11.719',
'data1': '4435461.97446237',
'data2': '321096.047655068',
'data3': '1260.354',
'data4': '64'},
{'Time': '8/12/2017 8:37:11.719',
'data1': '4435461.97446237',
'data2': '321106.049167927',
'data3': '1260.354',
'data4': '64'}],
('8/12/2017 8:37:26.919', 56): [{'Time': '8/12/2017 8:37:26.919',
'data1': '4436121.79704301',
'data2': '324546.569591528',
'data3': '1260.354',
'data4': '56'}],
('8/12/2017 8:37:26.919', 64): [{'Time': '8/12/2017 8:37:26.919',
'data1': '4436121.79704301',
'data2': '324646.584720121',
'data3': '1260.354',
'data4': '64'}],
('8/12/2017 8:37:26.919', 96): [{'Time': '8/12/2017 8:37:26.919',
'data1': '4436121.79704301',
'data2': '324496.562027231',
'data3': '1260.354',
'data4': '96'},
{'Time': '8/12/2017 8:37:26.919',
'data1': '4436121.79704301',
'data2': '324506.563540091',
'data3': '1260.354',
'data4': '96'}]}
如果你需要2级分组:首先是“时间”,然后是“数据4”,你需要2个循环:
new_dict = {}
for group1, records1 in itertools.groupby(reader, key=operator.itemgetter("Time")):
new_dict[group1] = {}
for group2, records2 in itertools.groupby(records1, key=lambda v: int(v["data4"])):
new_dict[group1][group2] = list(records2)
结果:
{'8/12/2017 8:37:11.719': {60: [{'Time': '8/12/2017 8:37:11.719',
'data1': '4435451.97715054',
'data2': '321346.085476551',
'data3': '1260.354',
'data4': '60'}],
64: [{'Time': '8/12/2017 8:37:11.719',
'data1': '4435461.97446237',
'data2': '321096.047655068',
'data3': '1260.354',
'data4': '64'},
{'Time': '8/12/2017 8:37:11.719',
'data1': '4435461.97446237',
'data2': '321106.049167927',
'data3': '1260.354',
'data4': '64'}]},
'8/12/2017 8:37:26.919': {56: [{'Time': '8/12/2017 8:37:26.919',
'data1': '4436121.79704301',
'data2': '324546.569591528',
'data3': '1260.354',
'data4': '56'}],
64: [{'Time': '8/12/2017 8:37:26.919',
'data1': '4436121.79704301',
'data2': '324646.584720121',
'data3': '1260.354',
'data4': '64'}],
96: [{'Time': '8/12/2017 8:37:26.919',
'data1': '4436121.79704301',
'data2': '324496.562027231',
'data3': '1260.354',
'data4': '96'},
{'Time': '8/12/2017 8:37:26.919',
'data1': '4436121.79704301',
'data2': '324506.563540091',
'data3': '1260.354',
'data4': '96'}]}}