尽管我可以在excel中创建数据透视表,但我还是希望使用Python最终将具有相似字段的行组合在一起。在这种情况下,该行应仅与相同的“ id”和“ location”组合。
输入:
id location date code
111 Park 1/1/2018 7765
143 School 2/5/2018 3345
111 Beach 1/1/2018 7534
223 Library 3/5/2018 3345
输出1:
id location date code
111 Park, Beach 1/1/2018 7765, 7534
143 School 2/5/2018 3345
223 Library 3/5/2018 3345
输出2:
id location1 location2 date code1 code2
111 Park Beach 1/1/2018 7765 7534
143 School 2/5/2018 3345
223 Library 3/5/2018 3345
我想了解两个输出的查询的唯一原因是因为我还有其他多个列,其中包含这些代码的定义。我知道我应该使用groupby ID和位置,但是,在输出1和输出2以及新行的创建中,我遇到了困难。
答案 0 :(得分:1)
IIUC,
mapper = lambda x : ",".join(x)
df["code"] = df["code"].astype(str)
df.groupby("id").agg({"location" : mapper, "code" : mapper})
location code
id
111 Park,Beach 7765,7534
143 School 3345
223 Library 3345
答案 1 :(得分:1)
尝试一下:
df['date'] = pd.to_datetime(df['date'])
df['code'] = df['code'].astype(str)
df = df.groupby(by=['id', 'date'], as_index=False).agg({'location': ','.join, 'code': ','.join})
print(df)
id date location code
0 111 2018-01-01 Park,Beach 7765,7534
1 143 2018-02-05 School 3345
2 223 2018-03-05 Library 3345
答案 2 :(得分:1)
您可以尝试以下方法:
在数据框中加入必要的列
df["code"] = df["code"].astype(str)
output1 = df.groupby("id").agg({"location": ",".join,"code":",".join,"date":'first'}).reset_index()
在这种方法中,如果某人A在同一天两次去学校,则输出将采用唯一值School
而不是显示School, School
。同时,假设某人A于同一天两次去学校,但有两个不同的代码,然后产生两个School,School
df["code"] = df["code"].astype(str)
output1 = df.groupby(["id","date"]).agg({"location": list,"code":list}).reset_index()
## check location and code having same set of unique values, will be performing `set` operation take unique elements
unique_values = output1[output1["location"].apply(set).apply(len) == output1["code"].apply(set).apply(len)]
## check location and code having different set of unique values, this case, might have same location with two different dates,
## no need to take `set` operation for this
other_values = output1[output1["location"].apply(set).apply(len) != output1["code"].apply(set).apply(len)]
## convert to set to , separated
unique_values["location"] = unique_values["location"].apply(set).apply(",".join)
unique_values["code"] = unique_values["code"].apply(set).apply(",".join)
other_values["location"] = other_values["location"].apply(",".join)
other_values["code"] = other_values["code"].apply(",".join)
## join both the dataframe
output1 = pd.concat([unique_values, other_values]).sort_index()
这将产生output1
个数据帧
以下代码用于扩展数据框中的location
和code
列,
output2 = output1["location"].str.split(pat=",",expand=True)
output2.columns = ["location_"+ str(i) for i in output2.columns]
output3 = output1["code"].str.split(pat=",",expand=True)
output3.columns = ["code"+ str(i) for i in output3.columns]
final_output = pd.concat([output1, output2, output3],axis=1)
final_output = final_output.fillna('')
最终输出为
答案 3 :(得分:1)
对于案例1的使用,请id and date
上的DataFrame.groupby
,然后将location and code
列汇总为使用.join
:
df1 = df.astype({'code': 'str'}).groupby(['id', 'date']).agg(', '.join).reset_index()
对于案例2,使用DataFrame.melt
,然后在id and variable
上使用DataFrame.groupby
,并使用cumcount
进行转换以向variable
列添加顺序计数器,最后使用.set_index
,unstack
,droplevel
。
df2 = df.melt(id_vars=['id', 'date'])
df2['variable'] += df2.groupby(['id', 'variable']).cumcount().add(1).astype(str)
df2 = df2.set_index(['id', 'date', 'variable']).unstack().droplevel(0, 1).reset_index()
结果:
# CASE 1: print(df1)
id date location code
0 111 1/1/2018 Park, Beach 7765, 7534
1 143 2/5/2018 School 3345
2 223 3/5/2018 Library 3345
# CASE 2: print(df2)
variable id date code1 code2 location1 location2
0 111 1/1/2018 7765 7534 Park Beach
1 143 2/5/2018 3345 NaN School NaN
2 223 3/5/2018 3345 NaN Library NaN