Question

尽管我可以在excel中创建数据透视表，但我还是希望使用Python最终将具有相似字段的行组合在一起。在这种情况下，该行应仅与相同的“ id”和“ location”组合。

输入：

id  location    date        code
111 Park        1/1/2018    7765
143 School      2/5/2018    3345
111 Beach       1/1/2018    7534
223 Library     3/5/2018    3345

输出1：

id  location      date      code
111 Park, Beach   1/1/2018  7765, 7534
143 School        2/5/2018  3345
223 Library       3/5/2018  3345

输出2：

id  location1     location2   date      code1    code2
111 Park          Beach       1/1/2018  7765     7534
143 School                    2/5/2018  3345
223 Library                   3/5/2018  3345

我想了解两个输出的查询的唯一原因是因为我还有其他多个列，其中包含这些代码的定义。我知道我应该使用groupby ID和位置，但是，在输出1和输出2以及新行的创建中，我遇到了困难。

Answer 1

IIUC，

mapper = lambda x : ",".join(x)
df["code"] = df["code"].astype(str)    
df.groupby("id").agg({"location" : mapper, "code" : mapper})

         location         code
id                            
111      Park,Beach     7765,7534
143        School         3345
223       Library         3345

Answer 2

尝试一下：

df['date'] = pd.to_datetime(df['date'])
df['code'] = df['code'].astype(str)
df = df.groupby(by=['id', 'date'], as_index=False).agg({'location': ','.join, 'code': ','.join})
print(df)

    id       date    location       code
0  111 2018-01-01  Park,Beach  7765,7534
1  143 2018-02-05      School       3345
2  223 2018-03-05     Library       3345

Answer 3

您可以尝试以下方法：

方法1

在数据框中加入必要的列

df["code"] = df["code"].astype(str)
output1 = df.groupby("id").agg({"location": ",".join,"code":",".join,"date":'first'}).reset_index()

方法2

在这种方法中，如果某人A在同一天两次去学校，则输出将采用唯一值School而不是显示School, School。同时，假设某人A于同一天两次去学校，但有两个不同的代码，然后产生两个School,School

df["code"] = df["code"].astype(str)
output1 = df.groupby(["id","date"]).agg({"location": list,"code":list}).reset_index()

## check location and code having same set of unique values, will be performing `set` operation take unique elements
unique_values = output1[output1["location"].apply(set).apply(len) == output1["code"].apply(set).apply(len)]
## check location and code having different set of unique values, this case, might have same location with two different dates,
## no need to take `set` operation for this
other_values = output1[output1["location"].apply(set).apply(len) != output1["code"].apply(set).apply(len)]

## convert to set to , separated
unique_values["location"] = unique_values["location"].apply(set).apply(",".join)
unique_values["code"] = unique_values["code"].apply(set).apply(",".join)

other_values["location"] = other_values["location"].apply(",".join)
other_values["code"] = other_values["code"].apply(",".join)

## join both the dataframe
output1 = pd.concat([unique_values, other_values]).sort_index()

这将产生output1个数据帧

以下代码用于扩展数据框中的location和code列，

output2 = output1["location"].str.split(pat=",",expand=True)
output2.columns = ["location_"+ str(i) for i in output2.columns] 
output3 = output1["code"].str.split(pat=",",expand=True)
output3.columns = ["code"+ str(i) for i in output3.columns] 
final_output = pd.concat([output1, output2, output3],axis=1)
final_output = final_output.fillna('')

最终输出为

Answer 4

对于案例1的使用，请id and date上的DataFrame.groupby，然后将location and code列汇总为使用.join：

df1 = df.astype({'code': 'str'}).groupby(['id', 'date']).agg(', '.join).reset_index()

对于案例2，使用DataFrame.melt，然后在id and variable上使用DataFrame.groupby，并使用cumcount进行转换以向variable列添加顺序计数器，最后使用.set_index，unstack，droplevel。

df2 = df.melt(id_vars=['id', 'date'])
df2['variable'] += df2.groupby(['id', 'variable']).cumcount().add(1).astype(str)
df2 = df2.set_index(['id', 'date', 'variable']).unstack().droplevel(0, 1).reset_index()

结果：

# CASE 1: print(df1)
    id      date     location        code
0  111  1/1/2018  Park, Beach  7765, 7534
1  143  2/5/2018       School        3345
2  223  3/5/2018      Library        3345

# CASE 2: print(df2)
variable   id      date code1 code2 location1 location2
0         111  1/1/2018  7765  7534      Park     Beach
1         143  2/5/2018  3345   NaN    School       NaN
2         223  3/5/2018  3345   NaN   Library       NaN

合并基于行的相似字段-Python

4 个答案:

方法1

方法2