R:如何从另外两个数据帧创建新数据帧

时间:2016-10-26 18:44:40

标签: r dataframe

我有两个包含相关数据的数据框。这与NFL有关。一个df有玩家名字和按周接收目标(玩家df):

           Player  Tm Position  1 2  3  4 5  6
1      A.J. Green CIN       WR 13 8 11 12 8 10
2 Aaron Burbridge SFO       WR  0 1  0  2 0  0
3 Aaron Ripkowski GNB       RB  0 0  0  0 0  1
4  Adam Humphries TAM       WR  5 8 12  4 2  0
5    Adam Thielen MIN       WR  5 5  4  3 8  0
6 Adrian Peterson MIN       RB  2 3  0  0 0  0

另一个数据框有每周由团队总结的接收目标(团队df):

       Tm   `1`   `2`   `3`   `4`   `5`   `6`
   <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     ARI    37    35    50    45    26    35
2     ATL    38    34    30    37    28    41
3     BAL    32    45    40    51    47    48
4     BUF    22    30    20    33    20    26
5     CAR    31    39    36    47    28    46
6     CHI    28    29    45    36    41    49
7     CIN    30    54    28    31    39    31
8     CLE    26    33    38    38    35    42
9     DAL    43    30    24    32    24    27
10    DEN    26    32    35    31    34    47
# ... with 22 more rows

我要做的是按周创建另一个包含目标百分比的数据框。所以我需要匹配来自&#34; Tm&#34;播放器df中的列和周列标题(1-6)。

我已经弄清楚如何通过合并它们然后创建新行来实现这一点,但是当我添加更多数据(周)时,我需要编写更多代码:

a <- merge(playertgt, teamtgt, by="Tm") #merges the two 
    a$Wk1 <- a$`1.x` / a$`1.y`
    a$Wk2 <- a$`2.x` / a$`2.y`
    a$Wk3 <- a$`3.x` / a$`3.y`

所以我正在寻找的是一个很好的方法,这样做会自动更新,并且不会让我必须创建一个df,其中包含一些我不需要的列,并且会更新新的一周,我将它们添加到我的源数据中。

如果在其他地方得到了回答,我道歉,但我一直在寻找一个好方法来做这一天,我找不到它。在此先感谢您的帮助!

2 个答案:

答案 0 :(得分:2)

您可以使用dplyr

执行此操作
library(dplyr)
## Do a left outer join to match each player with total team targets
a <- left_join(playertgt,teamtgt, by="Tm")
## Compute percentage over all weeks selecting player columns ending with ".x"
## and dividing by corresponding team columns ending with ".y"
tgt.pct <- select(a,ends_with(".x")) / select(a,ends_with(".y"))
## set the column names to week + number
colnames(tgt.pct) <- paste0("week",seq_len(ncol(teamtgt)-1))
## construct the output data frame adding back the player and team columns
tgt.pct <- data.frame(Player=playertgt$Player,Tm=playertgt$Tm,tgt.pct)

显然,为了方便dplyr选择加入后的列,我只使用ends_with。使用grepl进行此选择的base-R方法是:

a <- merge(playertgt, teamtgt, by="Tm", all.x=TRUE)
tgt.pct <- subset(a,select=grepl(".x$",colnames(a))) / subset(a,select=grepl(".y$",colnames(a)))
colnames(tgt.pct) <- paste0("week",seq_len(ncol(teamtgt)-1))
tgt.pct <- data.frame(Player=playertgt$Player,Tm=playertgt$Tm,tgt.pct)

数据:使用有限的发布数据,只有AJ Green会计算出目标百分比:

playertgt <- structure(list(Player = structure(1:6, .Label = c("A.J. Green", 
"Aaron Burbridge", "Aaron Ripkowski", "Adam Humphries", "Adam Thielen", 
"Adrian Peterson"), class = "factor"), Tm = structure(c(1L, 4L, 
2L, 5L, 3L, 3L), .Label = c("CIN", "GNB", "MIN", "SFO", "TAM"
), class = "factor"), Position = structure(c(2L, 2L, 1L, 2L, 
2L, 1L), .Label = c("RB", "WR"), class = "factor"), X1 = c(13L, 
0L, 0L, 5L, 5L, 2L), X2 = c(8L, 1L, 0L, 8L, 5L, 3L), X3 = c(11L, 
0L, 0L, 12L, 4L, 0L), X4 = c(12L, 2L, 0L, 4L, 3L, 0L), X5 = c(8L, 
0L, 0L, 2L, 8L, 0L), X6 = c(10L, 0L, 1L, 0L, 0L, 0L)), .Names = c("Player", 
"Tm", "Position", "X1", "X2", "X3", "X4", "X5", "X6"), class = "data.frame", row.names = c(NA, 
-6L))
##           Player  Tm Position X1 X2 X3 X4 X5 X6
##1      A.J. Green CIN       WR 13  8 11 12  8 10
##2 Aaron Burbridge SFO       WR  0  1  0  2  0  0
##3 Aaron Ripkowski GNB       RB  0  0  0  0  0  1
##4  Adam Humphries TAM       WR  5  8 12  4  2  0
##5    Adam Thielen MIN       WR  5  5  4  3  8  0
##6 Adrian Peterson MIN       RB  2  3  0  0  0  0

teamtgt <- structure(list(Tm = structure(1:10, .Label = c("ARI", "ATL", 
"BAL", "BUF", "CAR", "CHI", "CIN", "CLE", "DAL", "DEN"), class = "factor"), 
    X1 = c(37L, 38L, 32L, 22L, 31L, 28L, 30L, 26L, 43L, 26L), 
    X2 = c(35L, 34L, 45L, 30L, 39L, 29L, 54L, 33L, 30L, 32L), 
    X3 = c(50L, 30L, 40L, 20L, 36L, 45L, 28L, 38L, 24L, 35L), 
    X4 = c(45L, 37L, 51L, 33L, 47L, 36L, 31L, 38L, 32L, 31L), 
    X5 = c(26L, 28L, 47L, 20L, 28L, 41L, 39L, 35L, 24L, 34L), 
    X6 = c(35L, 41L, 48L, 26L, 46L, 49L, 31L, 42L, 27L, 47L)), .Names = c("Tm", 
"X1", "X2", "X3", "X4", "X5", "X6"), class = "data.frame", row.names = c(NA, 
-10L))
##    Tm X1 X2 X3 X4 X5 X6
##1  ARI 37 35 50 45 26 35
##2  ATL 38 34 30 37 28 41
##3  BAL 32 45 40 51 47 48
##4  BUF 22 30 20 33 20 26
##5  CAR 31 39 36 47 28 46
##6  CHI 28 29 45 36 41 49
##7  CIN 30 54 28 31 39 31
##8  CLE 26 33 38 38 35 42
##9  DAL 43 30 24 32 24 27
##10 DEN 26 32 35 31 34 47

结果是:

##           Player  Tm     week1     week2     week3     week4     week5     week6
##1      A.J. Green CIN 0.4333333 0.1481481 0.3928571 0.3870968 0.2051282 0.3225806
##2 Aaron Burbridge SFO        NA        NA        NA        NA        NA        NA
##3 Aaron Ripkowski GNB        NA        NA        NA        NA        NA        NA
##4  Adam Humphries TAM        NA        NA        NA        NA        NA        NA
##5    Adam Thielen MIN        NA        NA        NA        NA        NA        NA
##6 Adrian Peterson MIN        NA        NA        NA        NA        NA        NA

答案 1 :(得分:2)

如果你下次提供一些数据会更好,这会让生活变得更容易。

我认为重点是您的数据结构。我认为你必须把你的数据放到一个很长的格式(关键字是整齐的数据,我猜)。我编写了一些数据,希望我能正确理解你的问题。

library(tidyr)
library(dplyr)


player_df = data.frame(team = c('ARI', 'BAL', 'BAL', 'CLE', 'CLE'),
                       player =c('A', 'B', 'C', 'D', 'F'), 
                       '1' = floor(runif(5, min=1, max=2)*10),
                       '2' = floor(runif(5, min=1, max=2)*10))
> player_df
  team player X1 X2
1  ARI      A 15 10
2  BAL      B 16 15
3  BAL      C 13 11
4  CLE      D 14 19
5  CLE      F 12 14

team_df = data.frame(team = c('ARI', 'BAL', 'CLE'),
                       '1' = floor(runif(3, min=10, max=20)*20),
                       '2' = floor(runif(3, min=10, max=20)*20))
> team_df
  team  X1  X2
1  ARI 281 205
2  BAL 362 309
3  CLE 323 238

现在,将两个数据帧放入长格式:

player_df = gather(player_df, week, player_value, -team, -player)
team_df = gather(team_df, week, team_value, -team)

> player_df
   team player week player_value
1   ARI      A   X1           15
2   BAL      B   X1           16
3   BAL      C   X1           13
4   CLE      D   X1           14
5   CLE      F   X1           12
6   ARI      A   X2           10
7   BAL      B   X2           15
8   BAL      C   X2           11
9   CLE      D   X2           19
10  CLE      F   X2           14
> team_df
  team week team_value
1  ARI   X1        281
2  BAL   X1        362
3  CLE   X1        323
4  ARI   X2        205
5  BAL   X2        309
6  CLE   X2        238

现在,将它们连接(或合并)在一起。默认情况下,inner_join会加入常用列名。

join_db = inner_join(player_df, team_df)
> join_db
   team player week player_value team_value
1   ARI      A   X1           15        281
2   BAL      B   X1           16        362
3   BAL      C   X1           13        362
4   CLE      D   X1           14        323
5   CLE      F   X1           12        323
6   ARI      A   X2           10        205
7   BAL      B   X2           15        309
8   BAL      C   X2           11        309
9   CLE      D   X2           19        238
10  CLE      F   X2           14        238

我认为在那种格式中你可以做更多。

HTH

的Stefan