给出一个数据框(https://pastebin.com/MdqWz4Ke)
# some data
data3 = [["Alex","Tampa","A23","1","Ax","Red"],
["Alex","Tampa","A23","1","Ay","Blue"],
["Alex","Tampa","B43","1","Bx","Green"],
["Alex","Tampa","B43","1","By","White"],
["Alex","Tampa","C55","1","Cx","Red"],
["Alex","Tampa","C55","1","Cy","White"],
["Alex","Tampa","C55","2","Cx","Purple"],
["Alex","Tampa","C55","2","Cy","Black"],
["Tim","San Diego","A23","1","Ax","Green"],
["Tim","San Diego","A23","1","Ay","Black"],
["Tim","San Diego","B43","1","Bx","Yellow"],
["Tim","San Diego","B43","1","By","Black"],
["Tim","San Diego","C55","1","Cx","Pink"],
["Tim","San Diego","C55","1","Cy","Orange"],
["Tim","San Diego","A23","2","Ax","Green"],
["Tim","San Diego","A23","2","Ay","Red"],
["Tim","San Diego","B43","2","Bx",""],
["Tim","San Diego","B43","2","By",""],
["Mark","Houston","A23","1","Ax","Purple"],
["Mark","Houston","A23","1","Ay","Yellow"],
["Mark","Houston","B43","1","Bx","Gray"],
["Mark","Houston","B43","1","By","White"],
["Mark","Houston","C55","1","Cx",""],
["Mark","Houston","C55","1","Cy",""],
["Anthony","Seattle","A23","","Ax","Orange"],
["Anthony","Seattle","A23","","Ay","Black"],
["Anthony","Seattle","B43","","Bx","Red"],
["Anthony","Seattle","B43","","By","Black"],
["Anthony","Seattle","C55","","Cx","Blue"],
["Anthony","Seattle","C55","","Cy","Pink"]]
# create dataframe
df3 = pd.DataFrame(data3,columns=[
"Name","City","Domain","Sequence","Group","Value"])
如何比较组中的值并有条件地用这些值填充列?
# add Compared columns
df3["Compared Group"] = ""
df3["Compared Value"] = ""
# replace nulls with np.NaN
df3.replace(r"^s*$", np.nan, regex=True, inplace = True)
# fillna for missing Sequence and Value
df3.fillna({"Sequence":"N/A","Value":"NULL"},inplace=True)
# expected result
result = [["Alex","Tampa","A23","1","Ax","Red","Ay","Blue"],
["Alex","Tampa","B43","1","Bx","Green","By","White"],
["Alex","Tampa","C55","1","Cx","Red","Cy","White"],
["Alex","Tampa","C55","2","Cx","Purple","Cy","Black"],
["Tim","San Diego","A23","1","Ax","Green","Ay","Black"],
["Tim","San Diego","A23","2","Ax","Green","Ay","Red"],
["Tim","San Diego","B43","1","Bx","Yellow","By","Black"],
["Tim","San Diego","B43","2","Bx","NULL","By","NULL"],
["Tim","San Diego","C55","1","Cx","Pink","Cy","Orange"],
["Mark","Houston","A23","1","Ax","Purple","Ay","Yellow"],
["Mark","Houston","B43","1","Bx","Gray","By","White"],
["Mark","Houston","C55","1","Cx","NULL","Cy","NULL"],
["Anthony","Seattle","A23","","Ax","Orange","Ay","Black"],
["Anthony","Seattle","B43","","Bx","Red","By","Black"],
["Anthony","Seattle","C55","","Cx","Blue","Cy","Pink"]]
result_df = pd.DataFrame(result,columns=[
"Name","City","Domain","Sequence","Group",
"Value","Compared Group","Compared Value"])
注意:
如果某人的Group
值与另一个(Ax
至Ay
相匹配,
Bx
至By
,例如{}和Sequence
号相同,
Compared Group
和Compared Value
列
Group
和Value
。
比较中不考虑City
和Domain
,但所有
列需要保留。
某些行将没有Sequence
数字,因此我用
N/A
,以便对某些值进行分组。此外,某些行将没有
Value
列中的值,所以我用NULL
填充它们以填充
填充Compared Values
列时会出现一些问题。
我已经创建了一个映射Group
值的字典
# map groups with dictionary
group_dict = {"Ax":"Ay","Bx":"By","Cx":"Cy"}
并创建groupby
对象
# groupby
grouped = df3.groupby(["Name","Sequence","Domain","Group"], group_keys=False)
我最初的计划是.loc
,以便填充Compared
列,并可能在字典中使用map
,但是当尝试访问组中的值时... < / p>
for name in df3["Name"]:
print(grouped.get_group((name,"Ax")))
我收到以下错误:
ValueError: must supply a a same-length tuple to get_group with multiple grouping keys
我认为这是因为并非所有组都包含相同数量和类型的Group
值(例如,Tim为Sequence
拥有Ax
1和2,而Alex只有{ {1}}代表Sequence
)。我不确定如何从此处开始以合并和转换这些行。
答案 0 :(得分:1)
给出示例数据,您可以执行以下操作:
def myfunc(x):
# extract rows 0 2 4 ...
# reset_index rename the rows as 0 1 2 ...
df1 = x.iloc[::2].reset_index(drop=True)
# extract rows 1 3 5
df2 = x.iloc[1::2].reset_index(drop=True)
# merge put the two dataframes next together
return df1.merge(df2, left_index=True, right_index=True)
# group by the other columns and select only ['Group', 'Value']
(df3.groupby(['Name', 'City', 'Domain', 'Sequence'])[['Group','Value']]
.apply(myfunc) # concatenate the rows
.reset_index(-1, drop=True) # drop the unnecessary index
.reset_index() # make the other original columns as data instead of index
)
输出:
Name City Domain Sequence Group_x Value_x Group_y Value_y
0 Alex Tampa A23 1 Ax Red Ay Blue
1 Alex Tampa B43 1 Bx Green By White
2 Alex Tampa C55 1 Cx Red Cy White
3 Alex Tampa C55 2 Cx Purple Cy Black
4 Anthony Seattle A23 N/A Ax Orange Ay Black
5 Anthony Seattle B43 N/A Bx Red By Black
6 Anthony Seattle C55 N/A Cx Blue Cy Pink
7 Mark Houston A23 1 Ax Purple Ay Yellow
8 Mark Houston B43 1 Bx Gray By White
9 Mark Houston C55 1 Cx NULL Cy NULL
10 Tim Los Angeles A23 1 Ax Green Ay Black
11 Tim Los Angeles A23 2 Ax Green Ay Red
12 Tim Los Angeles B43 1 Bx Yellow By Black
13 Tim Los Angeles B43 2 Bx NULL By NULL
14 Tim Los Angeles C55 1 Cx Pink Cy Orange
答案 1 :(得分:0)
根据您的示例,您可以通过set_index
和Group
的自定义组ID尝试unstack
。接下来,使用漂亮的列名称并返回reset_index
s = df3.groupby(["Name","Sequence","Domain",]).Group.cumcount()
df_out = (df3.set_index(["Name", "City", "Sequence","Domain", s])
.unstack()
.sort_index(level=1, axis=1))
df_out.columns = (df_out.columns.set_levels(['', 'Compared '], level=1)
.map('{0[1]}{0[0]}'.format))
df_out.reset_index()
Out[297]:
Name City Sequence Domain Group Value Compared Group \
0 Alex Tampa 1 A23 Ax Red Ay
1 Alex Tampa 1 B43 Bx Green By
2 Alex Tampa 1 C55 Cx Red Cy
3 Alex Tampa 2 C55 Cx Purple Cy
4 Anthony Seattle N/A A23 Ax Orange Ay
5 Anthony Seattle N/A B43 Bx Red By
6 Anthony Seattle N/A C55 Cx Blue Cy
7 Mark Houston 1 A23 Ax Purple Ay
8 Mark Houston 1 B43 Bx Gray By
9 Mark Houston 1 C55 Cx NULL Cy
10 Tim Los Angeles 1 A23 Ax Green Ay
11 Tim Los Angeles 1 B43 Bx Yellow By
12 Tim Los Angeles 1 C55 Cx Pink Cy
13 Tim Los Angeles 2 A23 Ax Green Ay
14 Tim Los Angeles 2 B43 Bx NULL By
Compared Value
0 Blue
1 White
2 White
3 Black
4 Black
5 Black
6 Pink
7 Yellow
8 White
9 NULL
10 Black
11 Black
12 Orange
13 Red
14 NULL