Question

我有一个包含以下数据的pandas数据框：

matchID    server    court    speed
1          1         A         100
1          2         D         200
1          3         D         300
1          4         A         100
1          1         A         120
1          2         A         250
1          3         D         110
1          4         D         100
2          1         A         100
2          2         D         200
2          3         D         300
2          4         A         100
2          1         A         120
2          2         A         250
2          3         D         110
2          4         D         100

我想根据两个条件添加两个包含均值的新列。列meanSpeedCourtA13应包含servers 1和3的平均速度court = A。这将是(100 + 120) / 2 = 110。名为meanSpeedCourtD13的第二列应包含servers 1和3的平均速度court = D。这将是(300 + 110) / 2 = 205。

请注意，应该为每个matchID执行此操作，因此，还需要groupby。这意味着无法使用包含iloc()的解决方案。

结果数据框应如下所示：

matchID    server    court     speed    meanSpeedCourtA13   meanSpeedCourtD13
1          1         A         100      110                 205
1          2         D         200      110                 205
1          3         D         300      110                 205
1          4         A         100      110                 205
1          1         A         120      110                 205
1          2         A         250      110                 205
1          3         D         110      110                 205
1          4         D         100      110                 205
2          1         A         100      110                 205        
2          2         D         200      110                 205        
2          3         D         300      110                 205        
2          4         A         100      110                 205        
2          1         A         120      110                 205        
2          2         A         250      110                 205        
2          3         D         110      110                 205        
2          4         D         100      110                 205

Answer 1

好的，这有点复杂了。通常情况下，我会尝试改造，但如果有人有比以下更好的东西，我会很高兴：

使用groupby并将df发送到使用df.loc的func，最后使用pd.concat将数据帧再次粘合在一起：

import pandas as pd

data = {'matchID': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2, 
                    11: 2, 12: 2, 13: 2, 14: 2, 15: 2}, 
'court': {0: 'A', 1: 'D', 2: 'D', 3: 'A', 4: 'A', 5: 'A', 6: 'D', 7: 'D', 8: 'A',
          9: 'D', 10: 'D', 11: 'A', 12: 'A', 13: 'A', 14: 'D', 15: 'D'}, 
'speed': {0: 100, 1: 200, 2: 300, 3: 100, 4: 120, 5: 250, 6: 110, 7: 100, 8: 100, 
          9: 200, 10: 300, 11: 100, 12: 120, 13: 250, 14: 110, 15: 100}, 
'server': {0: 1, 1: 2, 2: 3, 3: 4, 4: 1, 5: 2, 6: 3, 7: 4, 8: 1, 9: 2, 10: 3, 
           11: 4, 12: 1, 13: 2, 14: 3, 15: 4}}

df = pd.DataFrame(data)

def func(dfx):
    dfx['meanSpeedCourtA13'],dfx['meanSpeedCourtD13'] = \
     (dfx.loc[(dfx.server.isin((1,3))) & (dfx.court == 'A'),'speed'].mean(),
      dfx.loc[(dfx.server.isin((1,3))) & (dfx.court == 'D'),'speed'].mean())
    return dfx

newdf = pd.concat(func(dfx) for _, dfx in df.groupby('matchID'))

print(newdf)

返回

   court  matchID  server  speed  meanSpeedCourtA13  meanSpeedCourtD13
0      A        1       1    100             110.00             205.00
1      D        1       2    200             110.00             205.00
2      D        1       3    300             110.00             205.00
3      A        1       4    100             110.00             205.00
4      A        1       1    120             110.00             205.00
5      A        1       2    250             110.00             205.00
6      D        1       3    110             110.00             205.00
7      D        1       4    100             110.00             205.00
8      A        2       1    100             110.00             205.00
9      D        2       2    200             110.00             205.00
10     D        2       3    300             110.00             205.00
11     A        2       4    100             110.00             205.00
12     A        2       1    120             110.00             205.00
13     A        2       2    250             110.00             205.00
14     D        2       3    110             110.00             205.00
15     D        2       4    100             110.00             205.00

Answer 2

您可以mean获取groupby并通过获取项目（）分配值，即

vals = df[df['server'].isin([1,3])].groupby(['court'])['speed'].mean().to_frame()


df['A13'],df['D13'] = vals.query("court=='A'")['speed'].item(), vals.query("court=='D'")['speed'].item()

    matchID  server court  speed    A13    D13
0         1       1     A    100  110.0  205.0
1         1       2     D    200  110.0  205.0
2         1       3     D    300  110.0  205.0
3         1       4     A    100  110.0  205.0
4         1       1     A    120  110.0  205.0
5         1       2     A    250  110.0  205.0
6         1       3     D    110  110.0  205.0
7         1       4     D    100  110.0  205.0
8         2       1     A    100  110.0  205.0
9         2       2     D    200  110.0  205.0
10        2       3     D    300  110.0  205.0
11        2       4     A    100  110.0  205.0
12        2       1     A    120  110.0  205.0
13        2       2     A    250  110.0  205.0
14        2       3     D    110  110.0  205.0
15        2       4     D    100  110.0  205.0

Answer 3

使用groupby，我们仍然可以使用loc选择我们要替换的预期部分，但将整个计算放在df.groupby("matchID")的for循环中。

for id, subg in df.groupby("matchID"):       
    df.loc[df.matchID==id, "meanSpeedCourtA13"] = (subg
              .where(subg.server.isin([1,3])).where(subg.court == "A").speed.mean())
    df.loc[df.matchID==id, "meanSpeedCourtD13"] = (subg
              .where(subg.server.isin([1,3])).where(subg.court == "D").speed.mean())

感谢@Dark指出我正在努力编码groupby。

对于loc，它可用于根据来自2个轴的信息选择值：行和列。按照惯例documentation，放置信息的顺序是行第一列，第二列。例如，在df.loc[df.matchID==id, "meanSpeedCourtD13"]中，df.matchID==id是关于选择matchID为id且"meanSpeedCourtD13"指定要查看的列的行。

关于计算平均值的附注：

为每个小组subg
where(subg.server.isin([1,3]))然后过滤掉不在[1,3]中的服务器。
where(subg.court == "A")进一步在法庭上进行过滤。
最后调用mean来计算速度均值。

作为替代方案，您可以使用np.where为[1,2]中的每个matchID分配值。这仅适用于二进制matchID。它与我在计算机上测试的groupby方法的速度大致相同。为节省空间，我们仅使用"meanSpeedCourtA13"列进行演示。

# First we calculate the means
# Calculate mean for Group with mathcID being 1
meanSpeedCourtA13_ID1 = (df[df.matchID==1].
                 where(df.server.isin([1,3])).where(df.court == "A").speed.mean())    
# Calculate mean for Group with matchID being 2
meanSpeedCourtA13_ID2 = (df[df.matchID==2].
                 where(df.server.isin([1,3])).where(df.court == "A").speed.mean())
# Use np.where to allocate values to each matchID in [1, 2]
df["meanSpeedCourtA13"] = np.where(df.matchID == 1,
                                   meanSpeedCourtA13_ID1, meanSpeedCourtA13_ID2)

对于np.where(condition, x, y)，如果满足条件，它将返回x，否则返回y。有关文档，请参见np.where。

Python Pandas平均根据条件进入新列

3 个答案: