基于多个值合并熊猫数据框中的行

时间:2021-01-13 14:14:37

标签: python pandas merge pandas-groupby

这本质上与Merge values of a dataframe where other columns match有关,但由于这个问题已经回答了,我没有找出不同问题的正确修改,我打开了这个新线程。希望没关系。对问题。我有以下数据

 date              car_brand    color     city      stolen
 "2020-01-01"      porsche      red       paris     False
 "2020-01-01"      porsche      red       london    False
 "2020-01-01"      porsche      red       munich    False
 "2020-01-01"      porsche      red       madrid    False
 "2020-01-01"      porsche      red       rome      False
 "2020-01-01"      porsche      blue      berlin    False 
 "2020-01-01"      porsche      blue      tokyo     False
 "2020-01-01"      porsche      blue      peking    False
 "2020-01-01"      porsche      white     liverpool False 
 "2020-01-01"      porsche      white     oslo      False
 "2020-01-01"      porsche      white     barcelona False
 "2020-01-01"      porsche      white     miami     False
 "2020-01-02"      porsche      red       paris     False
 "2020-01-02"      porsche      red       london    False
 "2020-01-02"      porsche      red       munich    False
 "2020-01-02"      porsche      red       madrid    False
 "2020-01-02"      porsche      red       rome      False
 "2020-01-02"      porsche      blue      berlin    False
 "2020-01-02"      porsche      blue      tokyo     False
 "2020-01-02"      porsche      blue      peking    False
 "2020-01-02"      porsche      white     liverpool False 
 "2020-01-02"      porsche      white     oslo      False
 "2020-01-02"      porsche      white     barcelona False
 "2020-01-02"      porsche      white     miami     False 
 "2020-01-03"      porsche      red       paris     False
 "2020-01-03"      porsche      red       london    False
 "2020-01-03"      porsche      red       munich    False
 "2020-01-03"      porsche      red       madrid    True
 "2020-01-03"      porsche      red       rome      False
 "2020-01-03"      porsche      blue      berlin    False
 "2020-01-03"      porsche      blue      tokyo     False
 "2020-01-03"      porsche      blue      peking    False
 "2020-01-03"      porsche      white     liverpool False 
 "2020-01-03"      porsche      white     oslo      False
 "2020-01-03"      porsche      white     barcelona False 
 "2020-01-03"      porsche      white     miami     False 
 "2020-01-04"      porsche      red       paris     False
 "2020-01-04"      porsche      red       london    False
 "2020-01-04"      porsche      red       munich    False
 "2020-01-04"      porsche      red       madrid    False
 "2020-01-04"      porsche      red       rome      False 
 "2020-01-04"      porsche      blue      berlin    False
 "2020-01-04"      porsche      blue      tokyo     False
 "2020-01-04"      porsche      blue      peking    False 
 "2020-01-04"      porsche      white     liverpool False
 "2020-01-04"      porsche      white     oslo      False
 "2020-01-04"      porsche      white     barcelona False
 "2020-01-04"      porsche      white     miami     False

我知道根据以下方式创建数据框的内容:如果连续几天布尔值“被盗”与所有条目匹配,那么我想合并日期列。例如,在上面的示例中,布尔条目匹配“2020-01-01”和“2020-01-02”。所以总的来说,我想得到以下结果:

 date                             car_brand    color     city      stolen
 ["2020-01-01","2020-01-02"]      porsche      red       paris     False
 ["2020-01-01","2020-01-02"]      porsche      red       london    False
 ["2020-01-01","2020-01-02"]      porsche      red       munich    False
 ["2020-01-01","2020-01-02"]      porsche      red       madrid    False
 ["2020-01-01","2020-01-02"]      porsche      red       rome      False
 ["2020-01-01","2020-01-02"]      porsche      blue      berlin    False 
 ["2020-01-01","2020-01-02"]      porsche      blue      tokyo     False
 ["2020-01-01","2020-01-02"]      porsche      blue      peking    False
 ["2020-01-01","2020-01-02"]      porsche      white     liverpool False 
 ["2020-01-01","2020-01-02"]      porsche      white     oslo      False
 ["2020-01-01","2020-01-02"]      porsche      white     barcelona False
 ["2020-01-01","2020-01-02"]      porsche      white     miami     False
 ["2020-01-03"]                   porsche      red       paris     False
 ["2020-01-03"]                   porsche      red       london    False
 ["2020-01-03"]                   porsche      red       munich    False
 ["2020-01-03"]                   porsche      red       madrid    True
 ["2020-01-03"]                   porsche      red       rome      False
 ["2020-01-03"]                   porsche      blue      berlin    False
 ["2020-01-03"]                   porsche      blue      tokyo     False
 ["2020-01-03"]                   porsche      blue      peking    False
 ["2020-01-03"]                   porsche      white     liverpool False 
 ["2020-01-03"]                   porsche      white     oslo      False
 ["2020-01-03"]                   porsche      white     barcelona False 
 ["2020-01-03"]                   porsche      white     miami     False 
 ["2020-01-04"]                   porsche      red       paris     False
 ["2020-01-04"]                   porsche      red       london    False
 ["2020-01-04"]                   porsche      red       munich    False
 ["2020-01-04"]                   porsche      red       madrid    False
 ["2020-01-04"]                   porsche      red       rome      False 
 ["2020-01-04"]                   porsche      blue      berlin    False
 ["2020-01-04"]                   porsche      blue      tokyo     False
 ["2020-01-04"]                   porsche      blue      peking    False 
 ["2020-01-04"]                   porsche      white     liverpool False
 ["2020-01-04"]                   porsche      white     oslo      False
 ["2020-01-04"]                   porsche      white     barcelona False
 ["2020-01-04"]                   porsche      white     miami     False

1 个答案:

答案 0 :(得分:1)

简而言之,代码没有从示例数据构建数据框。

关键技术是在日期 被盗更改的新列。 increment on value change

df["date"] = pd.to_datetime(df["date"])

# require new group when there is a stolen car in any date
df2 = (df.groupby("date")["stolen"].max().to_frame()
 .reset_index()
 .assign(stolen_grp=lambda dfa: (dfa.stolen.diff() != 0).cumsum())
 .drop(columns="stolen")
)

# put stolen_grp back into dataframe
df = df.merge(df2, on="date")

# same technique, breaking on days a car has been stolen
(
    df
    .groupby([c for c in df.columns if c!="date"])["date"]
    # only include if first date or if it's a consequetive date
    .agg(lambda x: [xx for i,xx in enumerate(x) if i==0 or xx==(list(x)[i-1]+pd.DateOffset(1))])
    .reset_index()
    .drop(columns="stolen_grp")
)

示例输出

car_brand color   city  stolen                                       date
  porsche  blue berlin   False [2020-01-01 00:00:00, 2020-01-02 00:00:00]
  porsche  blue berlin   False                      [2020-01-03 00:00:00]
  porsche  blue berlin   False                      [2020-01-04 00:00:00]
  porsche  blue peking   False [2020-01-01 00:00:00, 2020-01-02 00:00:00]
  porsche  blue peking   False                      [2020-01-03 00:00:00]
  porsche  blue peking   False                      [2020-01-04 00:00:00]
  porsche  blue  tokyo   False [2020-01-01 00:00:00, 2020-01-02 00:00:00]
  porsche  blue  tokyo   False                      [2020-01-03 00:00:00]
  porsche  blue  tokyo   False                      [2020-01-04 00:00:00]
  porsche   red london   False [2020-01-01 00:00:00, 2020-01-02 00:00:00]