考虑这样的数据框:
coordinates metric year
[55.2274742137, 25.1560686018] met_1 2014
[55.1554330879, 25.0986809174] met_2 2015
[55.1554330879, 25.0986809174] met_2 2016
[55.14353879, 25.44] met_221212 2020
[55.11239959, 25.3232] met_2132 2022
期望的结果:
coordinates metric year
[55.2274742137, 25.1560686018] met_1 2014
[55.1554330879, 25.0986809174] met_2 [2015,2016]
[55.14353879, 25.44] met_221212 2020
[55.11239959, 25.3232] met_2132 2022
我希望找到那些在coordinates
和metric
列上重复的记录。完成后,将year
指标附加到列表中,并将其作为新的year
列传递。然后,我想删除重复项
答案 0 :(得分:1)
但是如果列lists
:
TypeError:不可用类型:' list'
Solution已转换为可加密的tuples
。
另一个问题是,只有当lists
的值更多时才需要1
,所以需要有点复杂list comprehension
:
df.coordinates = df.coordinates.apply(tuple)
df = df.groupby(['coordinates','metric'], sort=False)['year']
.apply(lambda x: list(x) if len(x) > 1 else x.item())
df = df.reset_index()
df.coordinates = df.coordinates.apply(list)
print (df)
coordinates metric year
0 [55.2274742137, 25.1560686018] met_1 2014
1 [55.1554330879, 25.0986809174] met_2 [2015, 2016]
2 [55.14353879, 25.44] met_221212 2020
3 [55.11239959, 25.3232] met_2132 2022
如果可以在输出列中使用lists
表示所有值:
df.coordinates = df.coordinates.apply(tuple)
df = df.groupby(['coordinates','metric'], sort=False)['year'].apply(list)
df = df.reset_index()
df.coordinates = df.coordinates.apply(list)
print (df)
coordinates metric year
0 [55.2274742137, 25.1560686018] met_1 [2014]
1 [55.1554330879, 25.0986809174] met_2 [2015, 2016]
2 [55.14353879, 25.44] met_221212 [2020]
3 [55.11239959, 25.3232] met_2132 [2022]
如果需要输出strings
:
df.coordinates = df.coordinates.apply(tuple)
df = df.groupby(['coordinates','metric'], sort=False)['year']
.apply(lambda x: ','.join(x.astype(str)))
df = df.reset_index()
df.coordinates = df.coordinates.apply(list)
print (df)
coordinates metric year
0 [55.2274742137, 25.1560686018] met_1 2014
1 [55.1554330879, 25.0986809174] met_2 2015,2016
2 [55.14353879, 25.44] met_221212 2020
3 [55.11239959, 25.3232] met_2132 2022
答案 1 :(得分:0)
您可以在此处使用groupby作为帮助:
# dummy data
df = pd.DataFrame([[[55.2274742137, 25.1560686018], "met_1", 2014],
[[55.1554330879, 25.0986809174], "met_2", 2015],
[[55.1554330879, 25.0986809174], "met_2", 2015]],
columns=["coordinates", "metric", "year"])
print(df)
coordinates metric year
0 [55.2274742137, 25.1560686018] met_1 2014
1 [55.1554330879, 25.0986809174] met_2 2015
2 [55.1554330879, 25.0986809174] met_2 2015
# define apply function
def aggregate(sub_df):
years = sub_df["year"].values
if len(years) > 1:
return years
else:
return years[0]
# groupby needs hashable items, that's why we convert to tuple before
df["coordinates"] = df["coordinates"].apply(tuple)
# groupby and apply aggregator
print(df.groupby(["coordinates", "metric"]).apply(aggregate))
coordinates metric
(55.1554330879, 25.0986809174) met_2 [2015, 2015]
(55.2274742137, 25.1560686018) met_1 2014