我有看起来像这样的数据(我已经将'rule_id'设置为索引):
rule_id a b c d
50378 2 0 0 5
50402 12 9 6 0
52879 0 4 3 2
使用此代码后:
coeff = df.T
# compute the coefficients
for name, s in coeff.items():
top = 100 # start at 100
r = []
for i, v in enumerate(s):
if v == 0: # reset to 100 on a 0 value
top=100
else:
top = top/2 # else half the previous value
r.append(top)
coeff.loc[:, name] = r # set the whole column in one operation
# transpose back to have a companion dataframe for df
coeff = coeff.T
# build a new column from 2 consecutive ones, using the coeff dataframe
def build_comp(col1, col2, i):
conditions = [(df[col1] == 0) & (df[col2] == 0), (df[col1] != 0) & (df[col2] == 0), (df[col1] == df[col2]),
(df[col1] != 0) & (df[col2] != 0)]
choices = [np.nan , 100 , coeff[col1] , df[col2]/df[col1]*coeff[col1]+coeff[col1]]
df['comp{}'.format(i)] = np.select(conditions , choices)
old = df.columns[0] # store name of first column
#Ok, enumerate all the columns (except first one)
for i, col in enumerate(df.columns[1:], 1):
build_comp(old, col, i)
old = col # keep current column name for next iteration
# special processing for last comp column
df['comp{}'.format(i+1)] = np.where(df[col] == 0, np.nan, 100)
我的数据如下:
rule_id a b c d comp1 comp2 comp3 comp4
50378 2 0 0 5 100 NaN NaN 100
50402 12 9 6 0 87.5 41.66 100 NaN
52879 0 4 3 2 NaN 87.5 41.66 100
因此,这里的“ df”是存储我上面提到的数据的数据框。 看第一行。根据我的代码,如果比较两列,并且第一列具有非零值(2),第二列具有0,则应该在新列中更新100,如果有的话比较多个非零值(请看第2行),则比较如下:
9/12 *50 +50 = 87.5
然后
6/9 * 25 + 25 = 41.66
我能够实现,但是列“ c”和“ d”之间的第三个比较值介于6和0之间应该是:
0/6 *12.5 + 12.5 = 12.5
我在实现时遇到了问题。因此,该值应为12.5,而不是第2行comp3中的100。最后一行的值也分别为4,3和2
这是我想要的结果:
rule_id a b c d comp1 comp2 comp3 comp4
50378 2 0 0 5 100 NaN NaN 100
50402 12 9 6 0 87.5 41.66 12.5 NaN
52879 0 4 3 2 NaN 87.5 41.66 12.5
答案 0 :(得分:2)
您说:
列“ c”和“ d”之间的第三次比较(值介于6和0之间)应为:
0/6 *12.5 + 12.5 = 12.5
但是您的代码说:
conditions = [(df[col1] == 0) & (df[col2] == 0), (df[col1] != 0) & (df[col2] == 0), (df[col1] == df[col2]), (df[col1] != 0) & (df[col2] != 0)] choices = [np.nan , 100 , coeff[col1] , df[col2]/df[col1]*coeff[col1]+coeff[col1]]
显然(6, 0)
满足condition[1]
,因此产生100
。您似乎认为它应该满足condition[3]
,因为它们都是非零值,但是(6, 0)
不满足该条件,并且即使满足条件也没关系,因为condition[1]
已匹配首先,np.select()
选择第一个匹配项。
也许您想要这样的东西:
conditions = [(df[col1] == 0) & (df[col2] == 0), (df[col1] == df[col2])]
choices = [np.nan , coeff[col1]]
default = df[col2]/df[col1]*coeff[col1]+coeff[col1]
df['comp{}'.format(i)] = np.select(conditions , choices, default)
答案 1 :(得分:1)
只需参与一下,即可为您的代码做出贡献,以coeff
矩阵的定义为基础,在此矩阵上直接对整列进行计算。
初始化:
>>> df = pd.DataFrame([[2, 0, 0, 5], [12, 9, 6, 0], [0, 4, 3, 2]],
... index=[50378, 50402, 52879],
... columns=['a', 'b', 'c', 'd'])
>>> df
a b c d
50378 2 0 0 5
50402 12 9 6 0
52879 0 4 3 2
然后计算系数:
>>> # taking care of coefficients, using direct computation on columns
>>> coeff2 = pd.DataFrame(index=df.index, columns=df.columns)
>>> top = pd.Series([100]*len(df.index), index=df.index)
>>> for col_name, col in df.iteritems(): # loop over columns
... eq0 = (col==0) # boolean serie, identifying rows where content is 0
... top[eq0] = 100 # where `eq0` is `True`, set 100...
... top[~eq0] = top[~eq0] / 2 # ... and divide others by 2
... coeff2[col_name] = top # assign to output
>>> coeff2
哪个给:
a b c d
50378 50 100 100 50
50402 50 25 12.5 100
52879 100 50 25 12.5
(对于您的问题的核心,John确认该功能缺少条件,因此不需要我参与。)