根据条件分为不同的行

时间:2019-12-19 09:33:41

标签: python pandas grouping

我有一个如下数据框:

data = [
    [101, '1987-09-01', 1, 1, '1987-09-01', 2, 2],
    [102, '1987-09-01', 1, 1, '1999-09-01', 2, 2],
    [103, 'nan', 0, 0, '1999-09-01', 2, 2]
]
df = pd.DataFrame(data, columns=['ID', 'Date1', 'x1', 'y1', 'Date2', 'x2', 'y2'])
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])

我的目标

如果日期列的值在一行中相同,则将x和y值相加。 如果它们不相同,则将行分成两行,并保持它们的值不变。

用(伪)代码解释:

for name in df.columns:
if 'Date' in name:
    for index, row in df.iterrows():
        print(row[name])

        # Compare the values of the dates. See if they are equal
        if date1 == date2:
            # Sum the values of x1, x2. And sum the values of y1, y2

        if date1 != date2:
            # Group by date. Create two separate rows and do not sum the values of x and y.

另一个挑战是包含一个日期的列可能少于或多于2个。但是,列名将始终包含字符串“ Date”。例如,如果存在三个具有三个不同值的不同日期列,则目标是创建三行。如果只有1个日期列,则无需进行任何修改。

期望的结果

desired_outcome = [[101, '1987-09-01', 3, 3], [102, '1987-09-01', 1, 1], [102, '1999-09-01', 2, 2], [103, '1999-09-01', 2, 2]]
df_desired_outcome = pd.DataFrame(desired_outcome, columns=['ID', 'Date', 'x', 'y'])

1 个答案:

答案 0 :(得分:2)

首先使用wide_to_long进行重塑,然后聚合sum

df1 = pd.wide_to_long(df.reset_index(), 
                     stubnames=['Date','x','y'], 
                     i=['index','ID'], 
                     j='tmp')

df1 = df1.groupby(['index','ID','Date']).sum().reset_index(level=0, drop=True).reset_index()
print (df1)
    ID        Date  x  y
0  101  1987-09-01  3  3
1  102  1987-09-01  1  1
2  102  1999-09-01  2  2
3  103  1999-09-01  2  2
4  103         nan  0  0

如果ID值是唯一的,则应简化解决方案:

df1 = pd.wide_to_long(df, 
                     stubnames=['Date','x','y'], 
                     i='ID', 
                     j='tmp')

df1 = df1.groupby(['ID','Date']).sum().reset_index()
print (df1)
    ID        Date  x  y
0  101  1987-09-01  3  3
1  102  1987-09-01  1  1
2  102  1999-09-01  2  2
3  103  1999-09-01  2  2
4  103         nan  0  0

编辑:

如果列名不像日期列那样以1,2结尾,则可以按前2个字母对其进行规范化,然后在上方应用解决方案(存根名称已更改):

data = [
    [101, '1987-09-01', 1, 1, '1987-09-01', 2, 2],
    [102, '1987-09-01', 1, 1, '1999-09-01', 2, 2],
    [103, 'nan', 0, 0, '1999-09-01', 2, 2]
]
df = pd.DataFrame(data, columns=['ID', 'Date1', 'OPxx', 'NPxy', 
                                 'Date2', 'OPyy', 'NPyx'])

s = df.columns.to_series()
m = s.str.startswith(('ID','Date'))
s1 = s[~m].str[:2]
s2 = s1.groupby(s1).cumcount().add(1).astype(str)

s[~m] = s1 + s2
print (s)
ID          ID
Date1    Date1
OPxx       OP1
NPxy       NP1
Date2    Date2
OPyy       OP2
NPyx       NP2
dtype: object

df = df.rename(columns=s)
print (df)
    ID       Date1  OP1  NP1       Date2  OP2  NP2
0  101  1987-09-01    1    1  1987-09-01    2    2
1  102  1987-09-01    1    1  1999-09-01    2    2
2  103         nan    0    0  1999-09-01    2    2

EDIT2:我尝试创建更通用的解决方案:

data = [
    [101, '1987-09-01', 1, 1, '1987-09-01', 2, 2, 3],
    [102, '1987-09-01', 1, 1, '1999-09-01', 2, 2, 3],
    [103, 'nan', 0, 0, '1999-09-01', 2, 2, 3]
]
df = pd.DataFrame(data, columns=['ID', 'Date1', 'OPxx', 'NPxy', 'Date2',
                                 'OPyy', 'NPyx', 'WZ'])
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])


s = df.columns.to_series()

#get first 2 characters 
s1 = s.str[:2]
#create groups starting by ID and Da (first 2 letters of Date)
s2 = s1.isin(['ID','Da']).cumsum().astype(str)

s = s1 + s2
print (s)
ID       ID1
Date1    Da2
OPxx     OP2
NPxy     NP2
Date2    Da3
OPyy     OP3
NPyx     NP3
WZ       WZ3
dtype: object

df = df.rename(columns=s)
print (df)
   ID1        Da2  OP2  NP2        Da3  OP3  NP3  WZ3
0  101 1987-09-01    1    1 1987-09-01    2    2    3
1  102 1987-09-01    1    1 1999-09-01    2    2    3
2  103        NaT    0    0 1999-09-01    2    2    3

然后动态创建子名称-s1的所有唯一值,排除IDindex

print(np.setdiff1d(s1.unique(), ['ID', 'index']))
['Da' 'NP' 'OP' 'WZ']

df1 = pd.wide_to_long(df.reset_index(), 
                     stubnames=np.setdiff1d(s1.unique(), ['ID', 'index']), 
                     i=['index','ID1'], 
                     j='tmp')

总和:

df2 = (df1.groupby(['index','ID1','Da'])
          .sum()
          .reset_index(level=0, drop=True)
          .reset_index())
print (df2)
   ID1         Da  NP  OP   WZ
0  101 1987-09-01   3   3  3.0
1  102 1987-09-01   1   1  0.0
2  102 1999-09-01   2   2  3.0
3  103 1999-09-01   2   2  3.0