熊猫groupby并根据值填充列

时间:2019-09-11 16:37:23

标签: python-3.x pandas group-by

给出一个数据框(https://pastebin.com/MdqWz4Ke

# some data
data3 = [["Alex","Tampa","A23","1","Ax","Red"],
     ["Alex","Tampa","A23","1","Ay","Blue"],
     ["Alex","Tampa","B43","1","Bx","Green"],
     ["Alex","Tampa","B43","1","By","White"],
     ["Alex","Tampa","C55","1","Cx","Red"],
     ["Alex","Tampa","C55","1","Cy","White"],
     ["Alex","Tampa","C55","2","Cx","Purple"],
     ["Alex","Tampa","C55","2","Cy","Black"],
     ["Tim","San Diego","A23","1","Ax","Green"],
     ["Tim","San Diego","A23","1","Ay","Black"],
     ["Tim","San Diego","B43","1","Bx","Yellow"],
     ["Tim","San Diego","B43","1","By","Black"],
     ["Tim","San Diego","C55","1","Cx","Pink"],
     ["Tim","San Diego","C55","1","Cy","Orange"],
     ["Tim","San Diego","A23","2","Ax","Green"],
     ["Tim","San Diego","A23","2","Ay","Red"],
     ["Tim","San Diego","B43","2","Bx",""],
     ["Tim","San Diego","B43","2","By",""],        
     ["Mark","Houston","A23","1","Ax","Purple"],
     ["Mark","Houston","A23","1","Ay","Yellow"],
     ["Mark","Houston","B43","1","Bx","Gray"],
     ["Mark","Houston","B43","1","By","White"],
     ["Mark","Houston","C55","1","Cx",""],
     ["Mark","Houston","C55","1","Cy",""],        
     ["Anthony","Seattle","A23","","Ax","Orange"],
     ["Anthony","Seattle","A23","","Ay","Black"],
     ["Anthony","Seattle","B43","","Bx","Red"],
     ["Anthony","Seattle","B43","","By","Black"],
     ["Anthony","Seattle","C55","","Cx","Blue"],
     ["Anthony","Seattle","C55","","Cy","Pink"]]

# create dataframe
df3 = pd.DataFrame(data3,columns=[
        "Name","City","Domain","Sequence","Group","Value"])

如何比较组中的值并有条件地用这些值填充列?

# add Compared columns
df3["Compared Group"] = ""
df3["Compared Value"] = ""

# replace nulls with np.NaN
df3.replace(r"^s*$", np.nan, regex=True, inplace = True)

# fillna for missing Sequence and Value
df3.fillna({"Sequence":"N/A","Value":"NULL"},inplace=True)

# expected result
result = [["Alex","Tampa","A23","1","Ax","Red","Ay","Blue"],
          ["Alex","Tampa","B43","1","Bx","Green","By","White"],
          ["Alex","Tampa","C55","1","Cx","Red","Cy","White"],
          ["Alex","Tampa","C55","2","Cx","Purple","Cy","Black"],
         ["Tim","San Diego","A23","1","Ax","Green","Ay","Black"],
         ["Tim","San Diego","A23","2","Ax","Green","Ay","Red"],
         ["Tim","San Diego","B43","1","Bx","Yellow","By","Black"],
         ["Tim","San Diego","B43","2","Bx","NULL","By","NULL"],        
         ["Tim","San Diego","C55","1","Cx","Pink","Cy","Orange"],        
         ["Mark","Houston","A23","1","Ax","Purple","Ay","Yellow"],
         ["Mark","Houston","B43","1","Bx","Gray","By","White"],
         ["Mark","Houston","C55","1","Cx","NULL","Cy","NULL"],
         ["Anthony","Seattle","A23","","Ax","Orange","Ay","Black"],
         ["Anthony","Seattle","B43","","Bx","Red","By","Black"],
         ["Anthony","Seattle","C55","","Cx","Blue","Cy","Pink"]]

    result_df = pd.DataFrame(result,columns=[
            "Name","City","Domain","Sequence","Group",
            "Value","Compared Group","Compared Value"])

注意:

  • 如果某人的Group值与另一个(AxAy相匹配, BxBy,例如{}和Sequence号相同, Compared GroupCompared ValueGroupValue

  • 比较中不考虑CityDomain,但所有 列需要保留。

  • 某些行将没有Sequence数字,因此我用 N/A,以便对某些值进行分组。此外,某些行将没有 Value列中的值,所以我用NULL填充它们以填充 填充Compared Values列时会出现一些问题。

我已经创建了一个映射Group值的字典

# map groups with dictionary
group_dict = {"Ax":"Ay","Bx":"By","Cx":"Cy"}

并创建groupby对象

# groupby
grouped = df3.groupby(["Name","Sequence","Domain","Group"], group_keys=False)

我最初的计划是.loc,以便填充Compared列,并可能在字典中使用map,但是当尝试访问组中的值时... < / p>

for name in df3["Name"]:
    print(grouped.get_group((name,"Ax")))

我收到以下错误:

ValueError: must supply a a same-length tuple to get_group with multiple grouping keys

我认为这是因为并非所有组都包含相同数量和类型的Group值(例如,Tim为Sequence拥有Ax 1和2,而Alex只有{ {1}}代表Sequence)。我不确定如何从此处开始以合并和转换这些行。

2 个答案:

答案 0 :(得分:1)

给出示例数据,您可以执行以下操作:

def myfunc(x):
    # extract rows 0 2 4 ...
    # reset_index rename the rows as 0 1 2 ...
    df1 = x.iloc[::2].reset_index(drop=True)

    # extract rows 1 3 5
    df2 = x.iloc[1::2].reset_index(drop=True)

    # merge put the two dataframes next together
    return df1.merge(df2, left_index=True, right_index=True)

# group by the other columns and select only ['Group', 'Value']
(df3.groupby(['Name', 'City', 'Domain', 'Sequence'])[['Group','Value']]
 .apply(myfunc)                # concatenate the rows
 .reset_index(-1, drop=True)   # drop the unnecessary index
 .reset_index()                # make the other original columns as data instead of index
)

输出:

       Name         City Domain Sequence Group_x Value_x Group_y Value_y
0      Alex        Tampa    A23        1      Ax     Red      Ay    Blue
1      Alex        Tampa    B43        1      Bx   Green      By   White
2      Alex        Tampa    C55        1      Cx     Red      Cy   White
3      Alex        Tampa    C55        2      Cx  Purple      Cy   Black
4   Anthony      Seattle    A23      N/A      Ax  Orange      Ay   Black
5   Anthony      Seattle    B43      N/A      Bx     Red      By   Black
6   Anthony      Seattle    C55      N/A      Cx    Blue      Cy    Pink
7      Mark      Houston    A23        1      Ax  Purple      Ay  Yellow
8      Mark      Houston    B43        1      Bx    Gray      By   White
9      Mark      Houston    C55        1      Cx    NULL      Cy    NULL
10      Tim  Los Angeles    A23        1      Ax   Green      Ay   Black
11      Tim  Los Angeles    A23        2      Ax   Green      Ay     Red
12      Tim  Los Angeles    B43        1      Bx  Yellow      By   Black
13      Tim  Los Angeles    B43        2      Bx    NULL      By    NULL
14      Tim  Los Angeles    C55        1      Cx    Pink      Cy  Orange

答案 1 :(得分:0)

根据您的示例,您可以通过set_indexGroup的自定义组ID尝试unstack。接下来,使用漂亮的列名称并返回reset_index

s = df3.groupby(["Name","Sequence","Domain",]).Group.cumcount()
df_out = (df3.set_index(["Name", "City", "Sequence","Domain", s])
             .unstack()
             .sort_index(level=1, axis=1))
df_out.columns = (df_out.columns.set_levels(['', 'Compared '], level=1)
                                .map('{0[1]}{0[0]}'.format))
df_out.reset_index()

Out[297]:
       Name         City Sequence Domain Group   Value Compared Group  \
0      Alex        Tampa        1    A23    Ax     Red             Ay
1      Alex        Tampa        1    B43    Bx   Green             By
2      Alex        Tampa        1    C55    Cx     Red             Cy
3      Alex        Tampa        2    C55    Cx  Purple             Cy
4   Anthony      Seattle      N/A    A23    Ax  Orange             Ay
5   Anthony      Seattle      N/A    B43    Bx     Red             By
6   Anthony      Seattle      N/A    C55    Cx    Blue             Cy
7      Mark      Houston        1    A23    Ax  Purple             Ay
8      Mark      Houston        1    B43    Bx    Gray             By
9      Mark      Houston        1    C55    Cx    NULL             Cy
10      Tim  Los Angeles        1    A23    Ax   Green             Ay
11      Tim  Los Angeles        1    B43    Bx  Yellow             By
12      Tim  Los Angeles        1    C55    Cx    Pink             Cy
13      Tim  Los Angeles        2    A23    Ax   Green             Ay
14      Tim  Los Angeles        2    B43    Bx    NULL             By

   Compared Value
0            Blue
1           White
2           White
3           Black
4           Black
5           Black
6            Pink
7          Yellow
8           White
9            NULL
10          Black
11          Black
12         Orange
13            Red
14           NULL