我有来自retrosheet.org的事件文件数据。这是关于棒球比赛格式的数据,每次观察是棒球赛季每场比赛中每场比赛的描述(包括游戏,球员和比赛的参考变量)。
> str(e.2015.1990)
'data.frame': 4813807 obs. of 42 variables:
$ GAME.ID : Factor w/ 60464 levels "ANA201504100",..: 1 1 1 1 1 1 1 1 1 1 ...
$ INNING : num 1 1 1 1 1 1 1 1 1 2 ...
$ BATTING.TEAM : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 2 1 ...
$ OUTS : int 0 1 2 2 2 2 0 1 2 0 ...
$ BATTER : Factor w/ 5107 levels "abrej003","ackld001",..: 73 167 33 120 163 100 34 256 200 209 ...
$ BATTER.HAND : Factor w/ 2 levels "L","R": 2 1 2 1 2 1 1 2 2 2 ...
$ RES.BATTER : Factor w/ 5107 levels "abrej003","ackld001",..: 73 167 33 120 163 100 34 256 200 209 ...
$ RES.BATTER.HAND : Factor w/ 2 levels "L","R": 2 1 2 1 2 1 1 2 2 2 ...
$ PITCHER : Factor w/ 3481 levels "abadf001","albem001",..: 187 187 187 187 187 187 204 204 204 187 ...
$ PITCHER.HAND : Factor w/ 2 levels "L","R": 1 1 1 1 1 1 1 1 1 1 ...
$ RES.PITCHER : Factor w/ 3481 levels "abadf001","albem001",..: 187 187 187 187 187 187 204 204 204 187 ...
$ RES.PITCHER.HAND : Factor w/ 2 levels "L","R": 1 1 1 1 1 1 1 1 1 1 ...
$ FIRST.RUNNER : Factor w/ 4369 levels "","abrej003",..: 1 1 1 1 104 140 1 1 1 1 ...
$ SECOND.RUNNER : Factor w/ 4048 levels "","abrej003",..: 1 1 1 26 1 90 1 1 1 1 ...
$ THIRD.RUNNER : Factor w/ 3729 levels "","ackld001",..: 1 1 1 1 1 1 1 1 1 1 ...
$ EVENT.TEXT : chr "63/G" "6/P" "D8/L+" "S9/G.2-H" ...
$ EVENT.TYPE : Factor w/ 21 levels "2","3","4","5",..: 1 1 19 18 18 1 1 1 1 1 ...
$ AB.FLAG : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ HIT.VALUE : int 1 1 3 2 2 1 1 1 1 1 ...
$ SH.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ SF.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ DOUBLE.PLAY.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ TRIPLE.PLAY.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ RBI.ON.PLAY : num 0 0 0 1 0 0 0 0 0 0 ...
$ BATTED.BALL.TYPE : Factor w/ 5 levels "","F","G","L",..: 3 5 4 3 4 5 3 3 5 4 ...
$ BATTER.DEST : int 0 0 2 1 1 0 0 0 0 0 ...
$ RUNNER.ON.1ST.DEST : int 0 0 0 0 2 1 0 0 0 0 ...
$ RUNNER.ON.2ND.DEST : int 0 0 0 4 0 2 0 0 0 0 ...
$ RUNNER.ON.3RD.DEST : int 0 0 0 0 0 0 0 0 0 0 ...
$ SB.FOR.RUNNER.ON.1ST.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ SB.FOR.RUNNER.ON.2ND.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ SB.FOR.RUNNER.ON.3RD.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ CS.FOR.RUNNER.ON.1ST.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ CS.FOR.RUNNER.ON.2ND.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ CS.FOR.RUNNER.ON.3RD.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ PO.FOR.RUNNER.ON.1ST.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ PO.FOR.RUNNER.ON.2ND.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ PO.FOR.RUNNER.ON.3RD.FLAG : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ RESPONSIBLE.PITCHER.FOR.RUNNER.ON.1ST: Factor w/ 3433 levels "","albua001",..: 1 1 1 1 161 161 1 1 1 1 ...
$ RESPONSIBLE.PITCHER.FOR.RUNNER.ON.2ND: Factor w/ 3408 levels "","abadf001",..: 1 1 1 133 1 133 1 1 1 1 ...
$ RESPONSIBLE.PITCHER.FOR.RUNNER.ON.3RD: Factor w/ 3337 levels "","abadf001",..: 1 1 1 1 1 1 1 1 1 1 ...
$ EVENT.NUM : Factor w/ 177 levels "1","10","100",..: 1 90 101 112 123 134 145 156 167 2 ...
由此,我想为每个游戏计算每个玩家的游戏总数。我想格式化一个数据框,这样每个观察结果都是一个玩家在本赛季的一场比赛中表现的描述,每个比赛中的每个玩家都构成了整个观察结果。
我创建了一个包含两列GAME.ID和PLAYER.ID的新数据库,这样每个游戏中的每个STARTER都会构成整个观察结果。
> str(k.2015.1990)
'data.frame': 1146866 obs. of 2 variables:
$ GAME.ID : Factor w/ 60464 levels "ANA201504100",..: 1 2 3 4 5 6 7 8 9 10 ...
$ PLAYER.ID: Factor w/ 4699 levels "altuj001","bettm001",..: 11 11 11 12 14 12 12 24 24 24 ...
我认为我接下来需要做的是创建额外的向量(对于我想要计算的每个属性),这样每个对所述向量的观察都会创建我的事件数据的唯一子集,定义如下:
e.2015.1990$GAME.ID = k.2015.1990$GAME.ID
e.2015.1990$PLAYER.ID = k.2015.1990$PLAYER.ID
然后从该子集计算该stat。我知道如何在R中创建向量和子集,但不知道为每个观察创建唯一子集的向量。我想我需要使用
function(x)
这样做;但是,我是R的新手,没有这方面的经验。
为了方便起见,我打算尝试制作一个可重现的例子。在这个例子中,目标是计算Angel 2015常规赛前两场比赛中每位球员的命中总数。
我制作了一个事件文件数据的子集,其中包含与这两个游戏相对应的156个观察结果。为简单起见,我只包含变量GAME.ID,BATTER和HIT.VALUE。
GAME.ID BATTER HIT.VALUE
1 ANA201504100 escoa003 1
2 ANA201504100 mousm001 1
3 ANA201504100 cainl001 3
4 ANA201504100 hosme001 2
5 ANA201504100 morak001 2
6 ANA201504100 gorda001 1
7 ANA201504100 calhk001 1
8 ANA201504100 troum001 1
9 ANA201504100 pujoa001 1
10 ANA201504100 riosa002 1
11 ANA201504100 peres002 1
12 ANA201504100 infao001 1
13 ANA201504100 freed001 1
14 ANA201504100 cronc002 1
15 ANA201504100 aybae001 1
16 ANA201504100 escoa003 1
17 ANA201504100 mousm001 1
18 ANA201504100 cainl001 1
19 ANA201504100 hosme001 1
20 ANA201504100 morak001 1
21 ANA201504100 iannc001 1
22 ANA201504100 cowgc001 2
23 ANA201504100 giavj001 1
24 ANA201504100 calhk001 3
25 ANA201504100 troum001 1
26 ANA201504100 pujoa001 1
27 ANA201504100 gorda001 1
28 ANA201504100 riosa002 1
29 ANA201504100 peres002 1
30 ANA201504100 freed001 2
31 ANA201504100 cronc002 1
32 ANA201504100 aybae001 1
33 ANA201504100 iannc001 1
34 ANA201504100 infao001 1
35 ANA201504100 escoa003 2
36 ANA201504100 mousm001 1
37 ANA201504100 cainl001 2
38 ANA201504100 hosme001 1
39 ANA201504100 cowgc001 1
40 ANA201504100 giavj001 1
41 ANA201504100 calhk001 1
42 ANA201504100 morak001 5
43 ANA201504100 gorda001 1
44 ANA201504100 riosa002 1
45 ANA201504100 peres002 1
46 ANA201504100 troum001 2
47 ANA201504100 pujoa001 1
48 ANA201504100 freed001 5
49 ANA201504100 cronc002 1
50 ANA201504100 infao001 1
51 ANA201504100 escoa003 1
52 ANA201504100 mousm001 2
53 ANA201504100 cainl001 1
54 ANA201504100 cainl001 1
55 ANA201504100 aybae001 1
56 ANA201504100 iannc001 1
57 ANA201504100 joycm001 3
58 ANA201504100 giavj001 1
59 ANA201504100 hosme001 1
60 ANA201504100 morak001 1
61 ANA201504100 gorda001 1
62 ANA201504100 riosa002 1
63 ANA201504100 riosa002 1
64 ANA201504100 calhk001 1
65 ANA201504100 troum001 2
66 ANA201504100 pujoa001 1
67 ANA201504100 freed001 1
68 ANA201504100 peres002 2
69 ANA201504100 infao001 2
70 ANA201504100 escoa003 1
71 ANA201504100 mousm001 1
72 ANA201504100 cainl001 1
73 ANA201504100 hosme001 1
74 ANA201504100 morak001 1
75 ANA201504100 cronc002 1
76 ANA201504100 aybae001 1
77 ANA201504100 iannc001 1
78 ANA201504100 joycm001 1
79 ANA201504110 escoa003 1
80 ANA201504110 mousm001 1
81 ANA201504110 cainl001 1
82 ANA201504110 hosme001 1
83 ANA201504110 calhk001 5
84 ANA201504110 troum001 2
85 ANA201504110 pujoa001 1
86 ANA201504110 joycm001 1
87 ANA201504110 freed001 1
88 ANA201504110 morak001 1
89 ANA201504110 gorda001 1
90 ANA201504110 riosa002 1
91 ANA201504110 aybae001 2
92 ANA201504110 navae001 1
93 ANA201504110 buted001 1
94 ANA201504110 giavj001 1
95 ANA201504110 peres002 1
96 ANA201504110 infao001 1
97 ANA201504110 escoa003 1
98 ANA201504110 giavj001 1
99 ANA201504110 calhk001 1
100 ANA201504110 troum001 1
101 ANA201504110 mousm001 5
102 ANA201504110 cainl001 2
103 ANA201504110 hosme001 1
104 ANA201504110 hosme001 1
105 ANA201504110 morak001 3
106 ANA201504110 gorda001 1
107 ANA201504110 riosa002 2
108 ANA201504110 peres002 5
109 ANA201504110 infao001 2
110 ANA201504110 escoa003 1
111 ANA201504110 pujoa001 1
112 ANA201504110 joycm001 1
113 ANA201504110 freed001 1
114 ANA201504110 mousm001 1
115 ANA201504110 cainl001 1
116 ANA201504110 hosme001 2
117 ANA201504110 morak001 2
118 ANA201504110 gorda001 1
119 ANA201504110 riosa002 1
120 ANA201504110 aybae001 1
121 ANA201504110 navae001 1
122 ANA201504110 buted001 2
123 ANA201504110 giavj001 1
124 ANA201504110 calhk001 3
125 ANA201504110 troum001 2
126 ANA201504110 pujoa001 1
127 ANA201504110 riosa002 1
128 ANA201504110 peres002 2
129 ANA201504110 infao001 1
130 ANA201504110 escoa003 2
131 ANA201504110 mousm001 1
132 ANA201504110 joycm001 1
133 ANA201504110 freed001 1
134 ANA201504110 aybae001 1
135 ANA201504110 cainl001 1
136 ANA201504110 hosme001 1
137 ANA201504110 morak001 2
138 ANA201504110 gorda001 1
139 ANA201504110 riosa002 1
140 ANA201504110 navae001 1
141 ANA201504110 iannc001 1
142 ANA201504110 giavj001 1
143 ANA201504110 peres002 1
144 ANA201504110 infao001 1
145 ANA201504110 escoa003 1
146 ANA201504110 calhk001 1
147 ANA201504110 troum001 1
148 ANA201504110 pujoa001 1
149 ANA201504110 mousm001 2
150 ANA201504110 cainl001 1
151 ANA201504110 hosme001 1
152 ANA201504110 morak001 1
153 ANA201504110 gorda001 1
154 ANA201504110 joycm001 1
155 ANA201504110 freed001 1
156 ANA201504110 aybae001 1
我还制作了新数据库的一个子集,对应于这两场比赛的40名首发球员。
GAME.ID PLAYER.ID
1 ANA201504100 escoa003
60465 ANA201504100 mousm001
120929 ANA201504100 cainl001
181393 ANA201504100 hosme001
241857 ANA201504100 morak001
302321 ANA201504100 gorda001
362785 ANA201504100 riosa002
423249 ANA201504100 peres002
483713 ANA201504100 infao001
1117610 ANA201504100 vargj001
573434 ANA201504100 calhk001
633898 ANA201504100 troum001
694362 ANA201504100 pujoa001
754826 ANA201504100 freed001
815290 ANA201504100 cronc002
875754 ANA201504100 aybae001
936218 ANA201504100 iannc001
996682 ANA201504100 cowgc001
1057146 ANA201504100 giavj001
1117613 ANA201504100 santh001
2 ANA201504110 escoa003
60466 ANA201504110 mousm001
120930 ANA201504110 cainl001
181394 ANA201504110 hosme001
241858 ANA201504110 morak001
302322 ANA201504110 gorda001
362786 ANA201504110 riosa002
423250 ANA201504110 peres002
483714 ANA201504110 infao001
2100000 ANA201504110 guthj001
573435 ANA201504110 calhk001
633899 ANA201504110 troum001
694363 ANA201504110 pujoa001
754827 ANA201504110 joycm001
815291 ANA201504110 freed001
875755 ANA201504110 aybae001
936219 ANA201504110 navae001
996683 ANA201504110 buted001
1057147 ANA201504110 giavj001
2100001 ANA201504110 weavj003
我认为应该有一种方法可以向后一个数据库添加一个列,以便每个观察引用其行上的GAME.ID和PLAYER.ID条目,搜索前一个数据库以隔离那些GAME.ID = GAME.ID和PLAYER.ID = BATTER,计算该子集中观察的数量,其中HIT.VALUE> 1(1 =默认,2 =单,3 =双,4 =三,5 =本垒打),然后将该计数返回到观察。在excel中,这可以使用CountIf()函数来完成,我可以轻松地复制向量的长度。不过,我不知道如何在R中做到这一点。
答案 0 :(得分:0)
我认为这可能就是你要找的东西。它按GAME.ID
和BATTER
对第二个到最后一个数据集进行分组,然后计算每个组的点击次数> 1。
library(data.table)
dt<-setDT(df)[, list(count_hits = sum(HIT.VALUE>1)),by=c("GAME.ID","BATTER")]
head(dt)
GAME.ID BATTER count_hits
1: ANA201504100 escoa003 1
2: ANA201504100 mousm001 1
3: ANA201504100 cainl001 2
4: ANA201504100 hosme001 1
5: ANA201504100 morak001 2
6: ANA201504100 gorda001 0
基地R的另一个选择是:
res<-aggregate(x=list(count_hits=df$HIT.VALUE), by=list(GAME.ID=df$GAME.ID,BATTER=df$BATTER), FUN = function(x) sum(x>1) )
head(res)
GAME.ID BATTER count_hits
1 ANA201504100 aybae001 0
2 ANA201504110 aybae001 1
3 ANA201504110 buted001 1
4 ANA201504100 cainl001 2
5 ANA201504110 cainl001 1
6 ANA201504100 calhk001 1