Pandas,唯一条件,附加列字符串

时间:2017-03-01 13:29:11

标签: python pandas

考虑这样的数据框:

coordinates                     metric year
[55.2274742137, 25.1560686018]  met_1  2014
[55.1554330879, 25.0986809174]  met_2  2015
[55.1554330879, 25.0986809174]  met_2  2016
[55.14353879, 25.44]  met_221212  2020
[55.11239959, 25.3232]  met_2132  2022

期望的结果:

coordinates                     metric year
[55.2274742137, 25.1560686018]  met_1  2014
[55.1554330879, 25.0986809174]  met_2  [2015,2016]
[55.14353879, 25.44]  met_221212  2020
[55.11239959, 25.3232]  met_2132  2022

我希望找到那些在coordinatesmetric列上重复的记录。完成后,将year指标附加到列表中,并将其作为新的year列传递。然后,我想删除重复项

2 个答案:

答案 0 :(得分:1)

groupby需要apply

但是如果列lists

  

TypeError:不可用类型:' list'

Solution已转换为可加密的tuples

另一个问题是,只有当lists的值更多时才需要1,所以需要有点复杂list comprehension

df.coordinates = df.coordinates.apply(tuple)
df = df.groupby(['coordinates','metric'], sort=False)['year']
       .apply(lambda x: list(x) if len(x) > 1 else x.item())
df = df.reset_index()
df.coordinates = df.coordinates.apply(list)
print (df)
                      coordinates      metric          year
0  [55.2274742137, 25.1560686018]       met_1          2014
1  [55.1554330879, 25.0986809174]       met_2  [2015, 2016]
2            [55.14353879, 25.44]  met_221212          2020
3          [55.11239959, 25.3232]    met_2132          2022

如果可以在输出列中使用lists表示所有值:

df.coordinates = df.coordinates.apply(tuple)
df = df.groupby(['coordinates','metric'], sort=False)['year'].apply(list)
df = df.reset_index()
df.coordinates = df.coordinates.apply(list)
print (df)
                      coordinates      metric          year
0  [55.2274742137, 25.1560686018]       met_1        [2014]
1  [55.1554330879, 25.0986809174]       met_2  [2015, 2016]
2            [55.14353879, 25.44]  met_221212        [2020]
3          [55.11239959, 25.3232]    met_2132        [2022]

如果需要输出strings

df.coordinates = df.coordinates.apply(tuple)
df = df.groupby(['coordinates','metric'], sort=False)['year']
       .apply(lambda x: ','.join(x.astype(str)))
df = df.reset_index()
df.coordinates = df.coordinates.apply(list)
print (df)
                      coordinates      metric       year
0  [55.2274742137, 25.1560686018]       met_1       2014
1  [55.1554330879, 25.0986809174]       met_2  2015,2016
2            [55.14353879, 25.44]  met_221212       2020
3          [55.11239959, 25.3232]    met_2132       2022

答案 1 :(得分:0)

您可以在此处使用groupby作为帮助:

# dummy data
df = pd.DataFrame([[[55.2274742137, 25.1560686018], "met_1", 2014], 
                  [[55.1554330879, 25.0986809174], "met_2", 2015], 
                  [[55.1554330879, 25.0986809174], "met_2", 2015]],
                  columns=["coordinates", "metric", "year"])

print(df)
    coordinates                     metric  year
0   [55.2274742137, 25.1560686018]  met_1   2014
1   [55.1554330879, 25.0986809174]  met_2   2015
2   [55.1554330879, 25.0986809174]  met_2   2015

# define apply function
def aggregate(sub_df):
    years = sub_df["year"].values
    if len(years) > 1:
        return years
    else:
        return years[0]

# groupby needs hashable items, that's why we convert to tuple before
df["coordinates"] = df["coordinates"].apply(tuple)

# groupby and apply aggregator
print(df.groupby(["coordinates", "metric"]).apply(aggregate))

coordinates                     metric
(55.1554330879, 25.0986809174)  met_2     [2015, 2015]
(55.2274742137, 25.1560686018)  met_1            2014