基于另一列中值的累积计数

时间:2018-09-13 12:09:52

标签: python pandas count

我正在尝试基于其他cumulative count返回一个columns。对于下面的df,我想使用OutcomeAa,Bb,Cc,Dd返回一个计数。具体来说,如果XY处于“结果”状态,我想返回Aa,Bb,Cc,Dd中整数的最新增加值。因此,当列出XY时,我想返回Aa,Bb,Cc,Dd中最近增加的整数。

我尝试使用以下方法进行此操作:

import pandas as pd

d = ({
    'Outcome' : ['','','X','','','X','','Y','','Y'],
    'A' : [0,0,0,1,1,1,2,2,2,2],
    'B' : [0,0,0,1,1,1,1,1,2,2],
    'C' : [0,0,0,1,2,3,3,3,3,3],
    'D' : [0,1,2,2,2,2,2,2,2,2],                          
    })

df = pd.DataFrame(data = d)

m = pd.get_dummies(
      df.where(df.Outcome.ne(df.Outcome.shift()) & df.Outcome.str.len().astype(bool)
      ), prefix='Count').cumsum()

df = pd.concat([
     m.where(m.ne(m.shift())).fillna('', downcast='infer'), df], axis=1)

但这并不完全正确。

我的预期输出是:

  Outcome  A  B  C  D  A_X  A_Y  B_X  B_Y  C_X  C_Y  D_X  D_Y
0          0  0  0  0    0    0    0    0    0    0    0    0
1          0  0  0  1    0    0    0    0    0    0    0    0
2       X  0  0  0  2    0    0    0    0    0    0    1    0
3          1  1  1  2    0    0    0    0    0    0    1    0
4          1  1  2  2    0    0    0    0    0    0    1    0
5       X  1  1  3  2    0    0    0    0    1    0    1    0
6          2  1  3  2    0    0    0    0    1    0    1    0
7       Y  2  1  3  2    0    1    0    0    1    0    1    0
8          2  2  3  2    0    1    0    0    1    0    1    0
9       Y  2  2  3  2    0    1    0    1    1    0    1    0

2 个答案:

答案 0 :(得分:0)

要测试整数增加的列,并将“唯一值”列设置为变量,以便可以轻松地将该例程修改为使用其他列名输入数据框。

即使在输入数据帧较大的情况下,此例程也相对较快,因为它在循环中以及整个过程中使用了快速的numpy函数。

# this method assumes that only rows with an increase in one column
# only counts as an increase in value.
# rows with more than one column increasing are ignored.
# it also assumes that integers always increase by
# one.

import pandas as pd
import numpy as np

# designate the integer increase columns
tgt_cols = ['A', 'B', 'C', 'D']
unique_val_col = 'Outcome'

# put None in empty string positions within array
# of Outcome column values
oc_vals = df[unique_val_col].where(df[unique_val_col] != '', None).values
# find the unique strings in Outcome
uniques = pd.unique(oc_vals[oc_vals != None])

# use pandas diff to locate integer increases in columns
diffs = df[tgt_cols].diff().fillna(0).values.astype(int)

# add the values in each diffs row (this will help later
# to find rows without any column increase or multiple
# increases)
row_sums = np.sum(diffs, axis=1)

# find the row indexes where a single integer increase
# occurred
change_row_idx = np.where(row_sums == 1)[0]

# find the indexes where a single increase did not occur
no_change_idx = np.where((row_sums == 0) | (row_sums > 1))[0]
# remove row 0 from the index if it exists because it is
# not applicable to previous changes
if no_change_idx[0] == 0:
    no_change_idx = no_change_idx[1:]

# locate the indexes of previous rows which had an integer
# increase to carry forward to rows without an integer increase
# (no_change_idx)
fwd_fill_index = \
    [np.searchsorted(change_row_idx, x) - 1 for x in no_change_idx if x > 0]

# write over no change row(s) with data from the last row with an
# integer increase.
# now each row in diffs will have a one marking the last or current change
diffs[no_change_idx] = diffs[change_row_idx][fwd_fill_index]

# make an array to hold the combined output result array
num_rows = diffs.shape[0]
num_cols = diffs.shape[1] * len(uniques)
result_array = np.zeros(num_rows * num_cols) \
    .reshape(diffs.shape[0], diffs.shape[1] * len(uniques)).astype(int)

# determine the pattern for combining the unique value arrays.
# (the example has alternating columns for X and Y results)
concat_pattern = np.array(range(len(tgt_cols) * len(uniques))) % len(uniques)

# loop through the uniques values and do the following each time:
# make an array of zeros the same size as the diffs array.
# find the rows in the diffs array which are located one row up from
# to each unique value location in df.Outcome.
# put those rows into the array of zeros.
for i, u in enumerate(uniques):
    unique_val_ar = np.zeros_like(diffs)
    urows = np.where(oc_vals == u)[0]
    if urows[0] == 0:
        urows = urows[1:]
    # shift unique value index locations by -1
    adj_urows = urows - 1
    unique_val_ar[urows] = diffs[adj_urows]
    # put the columns from the unique_val_ar arrays
    # into the combined array according to the concat pattern
    # (tiled pattern per example)
    result_array[:, np.where(concat_pattern == i)[0]] = unique_val_ar

# find the cummulative sum of the combined array (vertical axis)
result_array_cumsums = np.cumsum(result_array, axis=0)

# make the column names for a new dataframe
# which will contain the result_array_cumsums array
tgt_vals = np.repeat(tgt_cols, len(uniques))
u_vals = np.tile(uniques, len(tgt_cols))
new_cols = ['_'.join(x) for x in list(zip(tgt_vals, u_vals))]

# make the dataframe, using the generated column names
df_results = pd.DataFrame(result_array_cumsums, columns=new_cols)

# join the result dataframe with the original dataframe
df_out = df.join(df_results)

print(df_out)

  Outcome  A  B  C  D  A_X  A_Y  B_X  B_Y  C_X  C_Y  D_X  D_Y
0          0  0  0  0    0    0    0    0    0    0    0    0
1          0  0  0  1    0    0    0    0    0    0    0    0
2       X  0  0  0  2    0    0    0    0    0    0    1    0
3          1  1  1  2    0    0    0    0    0    0    1    0
4          1  1  2  2    0    0    0    0    0    0    1    0
5       X  1  1  3  2    0    0    0    0    1    0    1    0
6          2  1  3  2    0    0    0    0    1    0    1    0
7       Y  2  1  3  2    0    1    0    0    1    0    1    0
8          2  2  3  2    0    1    0    0    1    0    1    0
9       Y  2  2  3  2    0    1    0    1    1    0    1    0

答案 1 :(得分:0)

下面是2个摘要:

  1. 按照说明,它捕获了A列中第一个X和第二个X之间的额外增加
  2. 按照示例,捕获所有4列中的最后一个增量

1)按照说明

for col in 'ABCD':
    df[col+'_X']=0
    df[col+'_Y']=0

for i1, i2 in zip(df[(df.Outcome=='X') | (df.Outcome=='Y') | (df.index==0)].index, 
                  df[(df.Outcome=='X') | (df.Outcome=='Y') | (df.index==0)].index[1::]):
    for col in 'ABCD':
        if df[col][i2]>df[col][i1]:
            df.loc[i2::,col+'_'+df.Outcome[i2]]=df[col+'_'+df.Outcome[i2]][i2-1]+1
print(df)

  Outcome  A  B  C  D  A_X  A_Y  B_X  B_Y  C_X  C_Y  D_X  D_Y
0          0  0  0  0    0    0    0    0    0    0    0    0
1          0  0  0  1    0    0    0    0    0    0    0    0
2       X  0  0  0  2    0    0    0    0    0    0    1    0
3          1  1  1  2    0    0    0    0    0    0    1    0
4          1  1  2  2    0    0    0    0    0    0    1    0
5       X  1  1  3  2    1    0    1    0    1    0    1    0
6          2  1  3  2    1    0    1    0    1    0    1    0
7       Y  2  1  3  2    1    1    1    0    1    0    1    0
8          2  2  3  2    1    1    1    0    1    0    1    0
9       Y  2  2  3  2    1    1    1    1    1    0    1    0

2)按照示例

for col in 'ABCD':
    df[col+'_X']=0
    df[col+'_Y']=0

for i1, i2 in zip(df[(df.Outcome=='X') | (df.Outcome=='Y') | (df.index==0)].index, 
                  df[(df.Outcome=='X') | (df.Outcome=='Y') | (df.index==0)].index[1::]):
    change_col = ''
    change_pos = -1
    for col in 'ABCD':
        if df[col][i2]>df[col][i1]:
            found_change_pos = df[df[col]==df[col][i2]-1].tail(1).index
            if found_change_pos > change_pos:
                change_col = col
                change_pos = found_change_pos
    if change_pos > -1:
        df.loc[i2::,change_col+'_'+df.Outcome[i2]]=df[change_col+'_'+df.Outcome[i2]][i2-1]+1
print(df)
  Outcome  A  B  C  D  A_X  A_Y  B_X  B_Y  C_X  C_Y  D_X  D_Y
0          0  0  0  0    0    0    0    0    0    0    0    0
1          0  0  0  1    0    0    0    0    0    0    0    0
2       X  0  0  0  2    0    0    0    0    0    0    1    0
3          1  1  1  2    0    0    0    0    0    0    1    0
4          1  1  2  2    0    0    0    0    0    0    1    0
5       X  1  1  3  2    0    0    0    0    1    0    1    0
6          2  1  3  2    0    0    0    0    1    0    1    0
7       Y  2  1  3  2    0    1    0    0    1    0    1    0
8          2  2  3  2    0    1    0    0    1    0    1    0
9       Y  2  2  3  2    0    1    0    1    1    0    1    0