我有2个数据框,如下所示
Teacher_Commission_df
如下
+---------+---------+----------+---------+
| Subject | Harare | Redcliff | Norton |
+---------+---------+----------+---------+
| Science | 0.100 | 0.125 | 0.145 |
+---------+---------+----------+---------+
| English | 0.125 | 0.150 | 0.170 |
+---------+---------+----------+---------+
| Maths | 0.090 | 0.115 | 0.135 |
+---------+---------+----------+---------+
| Music | 0.100 | 0.125 | 0.145 |
+---------+---------+----------+---------+
| Total | 0.415 | 0.515 | 0.595 |
+---------+---------+----------+---------+
Students_df
如下。 (请注意,Maths
和Harare
中Norton
的学生都没有)
+---------+--------+----------+--------+
| Subject | Harare | Redcliff | Norton |
+---------+--------+----------+--------+
| Science | 15 | 18 | 20 |
+---------+--------+----------+--------+
| English | 35 | 33 | 31 |
+---------+--------+----------+--------+
| Maths | | 25 | |
+---------+--------+----------+--------+
| Music | 40 | 42 | 45 |
+---------+--------+----------+--------+
我需要根据条件计算每个城市的加权平均佣金。
首先,我将给出所需的输出并说明方法。
期望的输出在s以下。
+------------+--------+----------+--------+
| Total_Paid | Harare | Redcliff | Norton |
+------------+--------+----------+--------+
| Science | 4.62 | 4.37 | 6.30 |
+------------+--------+----------+--------+
| English | 13.46 | 9.61 | 11.46 |
+------------+--------+----------+--------+
| Maths | 0.00 | 5.58 | 0.00 |
+------------+--------+----------+--------+
| Music | 12.31 | 10.19 | 14.18 |
+------------+--------+----------+--------+
计算方法
如果在任何城市列[Harare, Redcliff, Norton]
中,如果任何学科[Science, English, Maths, Music]
的学生人数为零,则应从权重中删除特定subject
的{{1}}。 >
例如,在Teacher_Commission
中:以Students_df
主题的城市Harare
列为例。由于Science
是Maths
中的zero
,因此Harare
的计算如下。 teacher_Commission
请注意,总分母中的15 * [0.10 / (0.415 - 0.09)] = 4.62
被删除。与0.09
中一样,它的计算方式为Radcliff
我希望我的解释清楚。
这可以通过使用18 * [0.125 / 0.515] = 4.37
条件在Microsoft Excel
中轻松完成。但是,我正在寻找可扩展的熊猫解决方案。
我不确定如何开始计算过程。因此,请让我开始解决这个问题。
IF
答案 0 :(得分:1)
那么,您需要的是数据框中每个空-空值的行/列索引?
您可以使用numpy.where()。可以根据空对象的数据类型
根据您的dtype,用Null或“”替换NaN
这类似于您使用IF在excel中所做的
就我个人而言,我只是将复制数据帧设为二进制,即在数据帧中存在非null值的地方放置1,在零位置放置0,然后将两个向量简化。但这可能会增加处理开销
答案 1 :(得分:1)
这实际上只是使用熊猫的两行代码:
import numpy as np
df_tmp = teacher_commission_df[~students_df.isnull()]
df = (df_tmp.div(df_tmp.apply(np.nansum, axis=0)) * students_df).fillna(0)
结果 (具有新的 3位数精度数据。)
In [1]: df
Out[1]:
Harare Redcliff Norton
Subject
Science 4.615385 4.368932 6.304348
English 13.461538 9.611650 11.456522
Maths 0.000000 5.582524 0.000000
Music 12.307692 10.194175 14.184783
注意:此说明使用原始问题中给出的 2位精度数据。
In [1]: students_df.isnull()
Out[1]:
Harare Redcliff Norton
Subject
Science False False False
English False False False
Maths True False True
Music False False False
teacher_commission_df
)从~
中选择非空值。In [3]: teacher_commission_df[~students_df.isnull()]
Out[3]:
Harare Redcliff Norton
Subject
Science 0.10 0.13 0.15
English 0.13 0.15 0.17
Maths NaN 0.12 NaN
Music 0.10 0.13 0.15
df_tmp
中:In [12]: df_tmp = teacher_commission_df[~students_df.isnull()]
In [14]: df_tmp.apply(np.nansum, axis=0)
Out[14]:
Harare 0.33
Redcliff 0.53
Norton 0.47
dtype: float64
DataFrame.div()
将求和与除法结合起来:In [15]: df_tmp.div(df_tmp.apply(np.nansum, axis=0))
Out[15]:
Harare Redcliff Norton
Subject
Science 0.303030 0.245283 0.319149
English 0.393939 0.283019 0.361702
Maths NaN 0.226415 NaN
Music 0.303030 0.245283 0.319149
In [16]: df_tmp.div(df_tmp.apply(np.nansum, axis=0)) * students_df
Out[16]:
Harare Redcliff Norton
Subject
Science 4.545455 4.415094 6.382979
English 13.787879 9.339623 11.212766
Maths NaN 5.660377 NaN
Music 12.121212 10.301887 14.361702
NaN
值:In [17]: (df_tmp.div(df_tmp.apply(np.nansum, axis=0)) * students_df).fillna(0)
Out[17]:
Harare Redcliff Norton
Subject
Science 4.545455 4.415094 6.382979
English 13.787879 9.339623 11.212766
Maths 0.000000 5.660377 0.000000
Music 12.121212 10.301887 14.361702
答案 2 :(得分:0)
基于User : aak
的建议。我已经设法完全从numpy
解决了这个问题。
# Load data and fill N/A values
Teacher_Commission_df = pd.read_excel('data_Teacher.xlsx',index_col='Subject', skipfooter=1)
Students_df = pd.read_excel('data_Studenst.xlsx',index_col='Subject')
Students_df.fillna(value=0, inplace= True)
# Convert Dataframes to Numpy Arrays
T = Teacher_Commission_df.to_numpy(dtype='float')
S = Students_df.to_numpy(dtype='float')
# Filter index of ZERO values from Students Numpy Array and
# replace the correponding Values in teachers Numpy Array
T[np.where(S == 0)] = 0
# creat a temporary Sum numpy array for calculation
Total_Teacher = T.sum(axis=0)
#calculate incentives
Calculations = T * (S/Total_Teacher)
incentives = (pd.DataFrame(Calculations, columns=Students_df.columns, index=Students_df.index)
.round(decimals=2)
.reset_index())
incentives