R中的聚合函数有三个限制

时间:2016-08-07 23:07:21

标签: r function aggregate countif

我有来自retrosheet.org的事件文件数据。这是关于棒球比赛格式的数据,每次观察是棒球赛季每场比赛中每场比赛的描述(包括游戏,球员和比赛的参考变量)。

> str(e.2015.1990)
'data.frame':   4813807 obs. of  42 variables:
 $ GAME.ID                              : Factor w/ 60464 levels "ANA201504100",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ INNING                               : num  1 1 1 1 1 1 1 1 1 2 ...
 $ BATTING.TEAM                         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 2 1 ...
 $ OUTS                                 : int  0 1 2 2 2 2 0 1 2 0 ...
 $ BATTER                               : Factor w/ 5107 levels "abrej003","ackld001",..: 73 167 33 120 163 100 34 256 200 209 ...
 $ BATTER.HAND                          : Factor w/ 2 levels "L","R": 2 1 2 1 2 1 1 2 2 2 ...
 $ RES.BATTER                           : Factor w/ 5107 levels "abrej003","ackld001",..: 73 167 33 120 163 100 34 256 200 209 ...
 $ RES.BATTER.HAND                      : Factor w/ 2 levels "L","R": 2 1 2 1 2 1 1 2 2 2 ...
 $ PITCHER                              : Factor w/ 3481 levels "abadf001","albem001",..: 187 187 187 187 187 187 204 204 204 187 ...
 $ PITCHER.HAND                         : Factor w/ 2 levels "L","R": 1 1 1 1 1 1 1 1 1 1 ...
 $ RES.PITCHER                          : Factor w/ 3481 levels "abadf001","albem001",..: 187 187 187 187 187 187 204 204 204 187 ...
 $ RES.PITCHER.HAND                     : Factor w/ 2 levels "L","R": 1 1 1 1 1 1 1 1 1 1 ...
 $ FIRST.RUNNER                         : Factor w/ 4369 levels "","abrej003",..: 1 1 1 1 104 140 1 1 1 1 ...
 $ SECOND.RUNNER                        : Factor w/ 4048 levels "","abrej003",..: 1 1 1 26 1 90 1 1 1 1 ...
 $ THIRD.RUNNER                         : Factor w/ 3729 levels "","ackld001",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ EVENT.TEXT                           : chr  "63/G" "6/P" "D8/L+" "S9/G.2-H" ...
 $ EVENT.TYPE                           : Factor w/ 21 levels "2","3","4","5",..: 1 1 19 18 18 1 1 1 1 1 ...
 $ AB.FLAG                              : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ HIT.VALUE                            : int  1 1 3 2 2 1 1 1 1 1 ...
 $ SH.FLAG                              : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ SF.FLAG                              : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ DOUBLE.PLAY.FLAG                     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ TRIPLE.PLAY.FLAG                     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ RBI.ON.PLAY                          : num  0 0 0 1 0 0 0 0 0 0 ...
 $ BATTED.BALL.TYPE                     : Factor w/ 5 levels "","F","G","L",..: 3 5 4 3 4 5 3 3 5 4 ...
 $ BATTER.DEST                          : int  0 0 2 1 1 0 0 0 0 0 ...
 $ RUNNER.ON.1ST.DEST                   : int  0 0 0 0 2 1 0 0 0 0 ...
 $ RUNNER.ON.2ND.DEST                   : int  0 0 0 4 0 2 0 0 0 0 ...
 $ RUNNER.ON.3RD.DEST                   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ SB.FOR.RUNNER.ON.1ST.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ SB.FOR.RUNNER.ON.2ND.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ SB.FOR.RUNNER.ON.3RD.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ CS.FOR.RUNNER.ON.1ST.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ CS.FOR.RUNNER.ON.2ND.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ CS.FOR.RUNNER.ON.3RD.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ PO.FOR.RUNNER.ON.1ST.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ PO.FOR.RUNNER.ON.2ND.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ PO.FOR.RUNNER.ON.3RD.FLAG            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ RESPONSIBLE.PITCHER.FOR.RUNNER.ON.1ST: Factor w/ 3433 levels "","albua001",..: 1 1 1 1 161 161 1 1 1 1 ...
 $ RESPONSIBLE.PITCHER.FOR.RUNNER.ON.2ND: Factor w/ 3408 levels "","abadf001",..: 1 1 1 133 1 133 1 1 1 1 ...
 $ RESPONSIBLE.PITCHER.FOR.RUNNER.ON.3RD: Factor w/ 3337 levels "","abadf001",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ EVENT.NUM                            : Factor w/ 177 levels "1","10","100",..: 1 90 101 112 123 134 145 156 167 2 ...

由此,我想为每个游戏计算每个玩家的游戏总数。我想格式化一个数据框,这样每个观察结果都是一个玩家在本赛季一场比赛中的表现的描述,每个比赛中的每个玩家都构成了整个观察结果。

我创建了一个包含两列GAME.ID和PLAYER.ID的新数据库,这样每个游戏中的每个STARTER都会构成整个观察结果。

> str(k.2015.1990)
'data.frame':   1146866 obs. of  2 variables:
 $ GAME.ID  : Factor w/ 60464 levels "ANA201504100",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ PLAYER.ID: Factor w/ 4699 levels "altuj001","bettm001",..: 11 11 11 12 14 12 12 24 24 24 ...

接下来我需要做的是创建额外的向量(对于我想要计算的每个属性),以便对所述向量的每次观察创建我的事件数据的唯一子集,定义如下:

e.2015.1990$GAME.ID = k.2015.1990$GAME.ID
e.2015.1990$PLAYER.ID = k.2015.1990$PLAYER.ID

然后从该子集计算该数据。

当从HIT.VALUE计算HITS时,聚合()似乎有效(其中,对于HIT.VALUE,1 =没有命中,2 =单,3 =双,4 =三,5 =本垒打)

p.hit = aggregate(x = list(HIT = e.2015.1990$HIT.VALUE), by = list(GAME.ID = e.2015.1990$GAME.ID, PLAYER.ID = e.2015.1990$BATTER), FUN = function(x) sum(x > 1))

> str(p.hit)
'data.frame':   1287476 obs. of  3 variables:
 $ GAME.ID  : Factor w/ 60464 levels "ANA201504100",..: 60 61 62 63 253 269 270 373 374 375 ...
 $ PLAYER.ID: Factor w/ 5107 levels "abrej003","ackld001",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ HIT      : int  0 3 0 1 0 0 1 2 3 1 ...

然而,当我采用这个公式来统计时,特别是单身:

p.single = aggregate(x = list(SINGLE = e.2015.1990$HIT.VALUE), by = list(GAME.ID = e.2015.1990$GAME.ID, PLAYER.ID = e.2015.1990$BATTER), FUN = function(x) sum(x = 2))

我明白了,#34; 2" s。

> str(p.single)
'data.frame':   1287476 obs. of  3 variables:
 $ GAME.ID  : Factor w/ 60464 levels "ANA201504100",..: 60 61 62 63 253 269 270 373 374 375 ...
 $ PLAYER.ID: Factor w/ 5107 levels "abrej003","ackld001",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ SINGLE   : num  2 2 2 2 2 2 2 2 2 2 ...

双打和3s,三联和4s以及本垒打和5s也是如此。

我认为应该有一种方法来计算一个向量,使每个观察引用其行上的GAME.ID和PLAYER.ID条目,搜索事件文件数据库以隔离那些GAME.ID = GAME.ID的观察值。和PLAYER.ID = BATTER,计算该子集中观察的数量,其中HIT.VALUE = 2(或者= 3表示双倍,= 4表示三元组,或= 5表示本垒打),然后将该计数返回到观察点。在excel中,这可以使用CountIf()函数来完成,我可以轻松地复制向量的长度。但是,我不知道如何在R中做到这一点。

0 个答案:

没有答案