我有一个体育团队跟踪胜/负的数据,而其他团队的结构如下:
Game TotalWins TotalLosses Team1Win Team1Loss Team2Win Team2Loss ...
1 1 0 1 NA NA NA
2 1 1 NA NA NA 1
3 2 1 NA NA 1 NA
4 2 2 NA 1 NA NA
5 3 2 NA NA 1 NA
...
我想创建一个因子变量,其中包含游戏所针对的团队,以便数据如下所示:
Game TotalWins TotalLosses Team1Win Team1Loss Team2Win Team2Loss Team
1 1 0 1 NA NA NA Team1
2 1 1 NA NA NA 1 Team2
3 2 1 NA NA 1 NA Team2
4 2 2 NA 1 NA NA Team1
5 3 2 NA NA 1 NA Team2
...
我的想法(不可行的代码)基本上是这样的:
if (Team1Win == 1 | Team1Loss == 1), Team = "Team1"
if (Team2Win == 1 | Team2Loss == 1), Team = "Team2"
我真的在使用mutate在dplyr中如何做到这一点。我已尝试过ifelse,recode等各种方法,但我不断得到错误或结果不是我想要的。
在dplyr中使用此功能的正确且最有效的方法是什么?
答案 0 :(得分:2)
与其他答案类似,但有一些有用的变化:
na.rm = TRUE
中gather
sub
工作得很好,不需要stringr
full_join
按目标包含完整数据。library(dplyr)
library(tidyr)
df = read.delim(text =
"Game TotalWins TotalLosses Team1Win Team1Loss Team2Win Team2Loss
1 1 0 1 NA NA NA
2 1 1 NA NA NA 1
3 2 1 NA NA 1 NA
4 2 2 NA 1 NA NA
5 3 2 NA NA 1 NA", sep = " ")
df %>%
select(-starts_with("Total")) %>%
gather(Team, one, -Game, na.rm = TRUE) %>%
select(-one) %>%
mutate(Team = sub("Win|Loss", "", Team)) %>%
full_join(df, .)
#> Joining, by = "Game"
#> Game TotalWins TotalLosses Team1Win Team1Loss Team2Win Team2Loss Team
#> 1 1 1 0 1 NA NA NA Team1
#> 2 2 1 1 NA NA NA 1 Team2
#> 3 3 2 1 NA NA 1 NA Team2
#> 4 4 2 2 NA 1 NA NA Team1
#> 5 5 3 2 NA NA 1 NA Team2
答案 1 :(得分:1)
我现在是dplyr
处理事情的傻瓜,因此我使用dplyr
提供了一个解决方案,该解决方案延伸到了您可能拥有的许多团队。它还使用tidyr
和stringr
在下面的apom评论中有用地指出。
library(dplyr)
library(tidyr)
library(stringr)
df = read_delim(
"Game TotalWins TotalLosses Team1Win Team1Loss Team2Win Team2Loss
1 1 0 1 NA NA NA
2 1 1 NA NA NA 1
3 2 1 NA NA 1 NA
4 2 2 NA 1 NA NA
5 3 2 NA NA 1 NA",delim = " ")
df %>%
gather("Team",value,contains("Team")) %>%
filter(!is.na(value)) %>%
mutate(Team = str_replace_all(Team,c("Win" = "","Loss" = ""))) %>%
select(-value)
答案 2 :(得分:0)
这可能就是你要找的东西。 (不仅仅是2支球队的硬编码)
# solution 1 :
paste0("Team",ceiling(apply(df[-c(1:3)], 1, function(x) which(!is.na(x)))/2))
[1] "Team1" "Team2" "Team2" "Team1" "Team2"
# solution 2: using a apply() {basically a for loop itself]
apply(df[-c(1:3)], 1, function(x) gsub("(Team\\d)+.*", "\\1", colnames(df[-c(1:3)])[which(!is.na(x))]))
# [1] "Team1" "Team2" "Team2" "Team1" "Team2"
# solution 3: (long route to dplyr) [ you have indirectly taught me a lot in dplyr through my search for this solution]
func <- function(x){
y = which(x == 1) # get the location of where 1 appears
z = rep(0, times = length(x)) # create a vector of 0's+location of 1
z[y] = y # i.e. c(0,0,3,0,5) for Team2Win
z
}
df1 = df[-c(1:3)] %>% gather("key", "value", starts_with("Team")) %>%
group_by(key) %>%
dplyr::mutate(x = func(value)) %>%
filter(x != 0) %>% arrange(x) %>% select(key)
df$newcol = gsub("(Team\\d+).*", "\\1", df1$key)
答案 3 :(得分:0)
您可以使用简单的循环来完成:
x = colnames(df)
df$team<- NA
for (i in 1:nrow(df))
{
df$team[i] = x[which(df[i,]==1)]
}
然后在最后你可以修剪&#34;赢得&#34;和&#34;损失&#34;使用以下功能:
df$team<- gsub("win", "",df$team)
df$team<- gsub("loss", "",df$team)
答案 4 :(得分:0)
我很确定您的数据中有两个以上的团队,团队名称不是通用的。您要做的是先将数据重新整形为长形,然后提取相关的团队名称。因此,您可能需要按照以下步骤进行操作。
library(dplyr)
library(tidyr)
new_df <- df %>%
gather(team,idx,Team1Win:Team100Loss) %>%
filter(!is.na(idx)) %>%
select(-idx) %>%
mutate(team = gsub("Win|Loss","",team))
如果要保留这些宽列,则可以将新DF加入旧列。