我的数据如下:
date, cola, colb, colc
1,10,,
2,11,,
3,12,,
4,13,,
1,,14,
2,,15,
3,,16,
4,,17,
1,,,17
2,,,18
3,,,19
4,13,,20
我想根据第一列合并行,输出如下:
date, cola, colb, colc
1,10,14,17
2,11,15,18
3,12,16,19
4,13,17,20
我不能保证不会有任何冲突,所以我希望能够选择最大值或平均值。
答案 0 :(得分:1)
您可以使用groupby
。从具有重复项的csv
开始:
>>> !cat tomerge.csv
date, cola, colb, colc
1,10,,
2,11,,
1,,14,
2,,15,
1,,24,
2,,40,
1,,,17
2,,,18
阅读:
>>> df = pd.read_csv("tomerge.csv")
>>> df
date cola colb colc
0 1 10 NaN NaN
1 2 11 NaN NaN
2 1 NaN 14 NaN
3 2 NaN 15 NaN
4 1 NaN 24 NaN
5 2 NaN 40 NaN
6 1 NaN NaN 17
7 2 NaN NaN 18
然后神奇的事情发生了:
>>> df.groupby("date").mean()
cola colb colc
date
1 10 19.0 17
2 11 27.5 18
>>> df.groupby("date").max()
cola colb colc
date
1 10 24 17
2 11 40 18