我有来自retrosheet.org的事件文件数据。这是关于棒球比赛格式的数据,每次观察是棒球赛季每场比赛中每场比赛的描述(包括游戏,球员和比赛的参考变量)。
> str(e.2015.1990)
'data.frame': 4813807 obs. of 42 variables:
$ GAME.ID : Factor w/ 60464 levels "ANA201504100",..: 1 1 1 1 1 1 1 1 1 1 ...
$ INNING : num 1 1 1 1 1 1 1 1 1 2 ...
$ BATTING.TEAM : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 2 1 ...
$ OUTS : int 0 1 2 2 2 2 0 1 2 0 ...
$ BATTER : Factor w/ 5107 levels "abrej003","ackld001",..: 73 167 33 120 163 100 34 256 200 209 ...
$ BATTER.HAND : Factor w/ 2 levels "L","R": 2 1 2 1 2 1 1 2 2 2 ...
$ RES.BATTER : Factor w/ 5107 levels "abrej003","ackld001",..: 73 167 33 120 163 100 34 256 200 209 ...
$ RES.BATTER.HAND : Factor w/ 2 levels "L","R": 2 1 2 1 2 1 1 2 2 2 ...
$ PITCHER : Factor w/ 3481 levels "abadf001","albem001",..: 187 187 187 187 187 187 204 204 204 187 ...
$ PITCHER.HAND : Factor w/ 2 levels "L","R": 1 1 1 1 1 1 1 1 1 1 ...
$ RES.PITCHER : Factor w/ 3481 levels "abadf001","albem001",..: 187 187 187 187 187 187 204 204 204 187 ...
$ RES.PITCHER.HAND : Factor w/ 2 levels "L","R": 1 1 1 1 1 1 1 1 1 1 ...
$ FIRST.RUNNER : Factor w/ 4369 levels "","abrej003",..: 1 1 1 1 104 140 1 1 1 1 ...
$ SECOND.RUNNER : Factor w/ 4048 levels "","abrej003",..: 1 1 1 26 1 90 1 1 1 1 ...
$ THIRD.RUNNER : Factor w/ 3729 levels "","ackld001",..: 1 1 1 1 1 1 1 1 1 1 ...
$ EVENT.TEXT : chr "63/G" "6/P" "D8/L+" "S9/G.2-H" ...
$ EVENT.TYPE : Factor w/ 21 levels "2","3","4","5",..: 1 1 19 18 18 1 1 1 1 1 ...
$ AB.FLAG : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ HIT.VALUE : int 1 1 3 2 2 1 1 1 1 1 ...
$ SH.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ SF.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ DOUBLE.PLAY.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ TRIPLE.PLAY.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ RBI.ON.PLAY : num 0 0 0 1 0 0 0 0 0 0 ...
$ BATTED.BALL.TYPE : Factor w/ 5 levels "","F","G","L",..: 3 5 4 3 4 5 3 3 5 4 ...
$ BATTER.DEST : int 0 0 2 1 1 0 0 0 0 0 ...
$ RUNNER.ON.1ST.DEST : int 0 0 0 0 2 1 0 0 0 0 ...
$ RUNNER.ON.2ND.DEST : int 0 0 0 4 0 2 0 0 0 0 ...
$ RUNNER.ON.3RD.DEST : int 0 0 0 0 0 0 0 0 0 0 ...
$ SB.FOR.RUNNER.ON.1ST.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ SB.FOR.RUNNER.ON.2ND.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ SB.FOR.RUNNER.ON.3RD.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ CS.FOR.RUNNER.ON.1ST.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ CS.FOR.RUNNER.ON.2ND.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ CS.FOR.RUNNER.ON.3RD.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ PO.FOR.RUNNER.ON.1ST.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ PO.FOR.RUNNER.ON.2ND.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ PO.FOR.RUNNER.ON.3RD.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ RESPONSIBLE.PITCHER.FOR.RUNNER.ON.1ST: Factor w/ 3433 levels "","albua001",..: 1 1 1 1 161 161 1 1 1 1 ...
$ RESPONSIBLE.PITCHER.FOR.RUNNER.ON.2ND: Factor w/ 3408 levels "","abadf001",..: 1 1 1 133 1 133 1 1 1 1 ...
$ RESPONSIBLE.PITCHER.FOR.RUNNER.ON.3RD: Factor w/ 3337 levels "","abadf001",..: 1 1 1 1 1 1 1 1 1 1 ...
$ EVENT.NUM : Factor w/ 177 levels "1","10","100",..: 1 90 101 112 123 134 145 156 167 2 ...
由此,我想为每个游戏计算每个玩家的游戏总数。我想格式化一个数据框,这样每个观察结果都是一个玩家在本赛季一场比赛中的表现的描述,每个比赛中的每个玩家都构成了整个观察结果。
我创建了一个包含两列GAME.ID和PLAYER.ID的新数据库,这样每个游戏中的每个STARTER都会构成整个观察结果。
> str(k.2015.1990)
'data.frame': 1146866 obs. of 2 variables:
$ GAME.ID : Factor w/ 60464 levels "ANA201504100",..: 1 2 3 4 5 6 7 8 9 10 ...
$ PLAYER.ID: Factor w/ 4699 levels "altuj001","bettm001",..: 11 11 11 12 14 12 12 24 24 24 ...
接下来我需要做的是创建额外的向量(对于我想要计算的每个属性),以便对所述向量的每次观察创建我的事件数据的唯一子集,定义如下:
e.2015.1990$GAME.ID = k.2015.1990$GAME.ID
e.2015.1990$PLAYER.ID = k.2015.1990$PLAYER.ID
然后从该子集计算该数据。
当从HIT.VALUE计算HITS时,聚合()似乎有效(其中,对于HIT.VALUE,1 =没有命中,2 =单,3 =双,4 =三,5 =本垒打)
p.hit = aggregate(x = list(HIT = e.2015.1990$HIT.VALUE), by = list(GAME.ID = e.2015.1990$GAME.ID, PLAYER.ID = e.2015.1990$BATTER), FUN = function(x) sum(x > 1))
> str(p.hit)
'data.frame': 1287476 obs. of 3 variables:
$ GAME.ID : Factor w/ 60464 levels "ANA201504100",..: 60 61 62 63 253 269 270 373 374 375 ...
$ PLAYER.ID: Factor w/ 5107 levels "abrej003","ackld001",..: 1 1 1 1 1 1 1 1 1 1 ...
$ HIT : int 0 3 0 1 0 0 1 2 3 1 ...
然而,当我采用这个公式来统计时,特别是单身:
p.single = aggregate(x = list(SINGLE = e.2015.1990$HIT.VALUE), by = list(GAME.ID = e.2015.1990$GAME.ID, PLAYER.ID = e.2015.1990$BATTER), FUN = function(x) sum(x = 2))
我明白了,#34; 2" s。
> str(p.single)
'data.frame': 1287476 obs. of 3 variables:
$ GAME.ID : Factor w/ 60464 levels "ANA201504100",..: 60 61 62 63 253 269 270 373 374 375 ...
$ PLAYER.ID: Factor w/ 5107 levels "abrej003","ackld001",..: 1 1 1 1 1 1 1 1 1 1 ...
$ SINGLE : num 2 2 2 2 2 2 2 2 2 2 ...
双打和3s,三联和4s以及本垒打和5s也是如此。
我认为应该有一种方法来计算一个向量,使每个观察引用其行上的GAME.ID和PLAYER.ID条目,搜索事件文件数据库以隔离那些GAME.ID = GAME.ID的观察值。和PLAYER.ID = BATTER,计算该子集中观察的数量,其中HIT.VALUE = 2(或者= 3表示双倍,= 4表示三元组,或= 5表示本垒打),然后将该计数返回到观察点。在excel中,这可以使用CountIf()函数来完成,我可以轻松地复制向量的长度。但是,我不知道如何在R中做到这一点。