如何为不同列组的唯一数据框值集合分配唯一标识符?

时间:2017-07-10 08:26:39

标签: r

我有一个类似于此的数据框:

teamAPlayer1    teamAPlayer2    teamBPlayer1    teamBPlayer2
Jack            Jill            Matt            Megan
Jill            Jack            Megan           Matt
Megan           Jill            Matt            Jack
Megan           Matt            Jill            Jack
Megan           Jack            Jill            Matt

我的目标是为每个独特的球队阵容分配一个唯一的ID,无论球员数量是多少,以及他们是在A队还是B队。对于上面的例子,我想添加以下两列我的数据框:

teamAPlayer1    teamAPlayer2    teamAID    teamBPlayer1    teamBPlayer2    teamBID
Jack            Jill            1          Matt            Megan           2
Jill            Jack            1          Megan           Matt            2
Megan           Jill            3          Matt            Jack            4
Megan           Matt            2          Jill            Jack            1
Jack            Matt            4          Jill            Megan           3

我可以编写一个使用for / while循环索引的解决方案,但是我正在处理一个非常大的数据框架,每个团队有5个玩家而不是2个,因此脚本运行需要很长时间。有可能用矢量化方法解决这个问题吗?

4 个答案:

答案 0 :(得分:0)

您的数据

df <- data.frame(teamAPlayer1=c("Jack","Jill","Megan","Megan","Megan"),
                 teamAPlayer2=c("Jill","Jack","Jill","Matt","Jack"),
                 teamBPlayer1=c("Matt","Megan","Matt","Jill","Jill"),
                 teamBPlayer2=c("Megan","Matt","Jack","Jack","Matt"),
                 stringsAsFactors=F)

制作独特玩家名称的载体

# Grab all unique player names - assign to each a number
unique.id <- seq(1, length(unique(unlist(df))), 1)
names(unique.id) <- unique(unlist(df))

# Paste and sort player pair combinations in new columns
df1 <- df %>%
   rowwise() %>%
   mutate(teamApairs=paste0(sort(c(unique.id[teamAPlayer1],unique.id[teamAPlayer2])),collapse=" ")) %>%
   mutate(teamBpairs=paste0(sort(c(unique.id[teamBPlayer1],unique.id[teamBPlayer2])),collapse=" ")) %>%

制作独特玩家对的载体

# Grab all unique player pairs - assign to each a unique number
unique.pairs <- seq(1, length(unique(unlist(df1[,5:6]))), 1)
names(unique.pairs) <- unique(unlist(df1[,5:6]))

# Factorize unique player pairs as unique number
df2 <- df1 %>%
       mutate(teamAID=unique.pairs[teamApairs]) %>%
       mutate(teamBID=unique.pairs[teamBpairs]) %>%
       select(-teamApairs,-teamBpairs)

输出

  teamAPlayer1 teamAPlayer2 teamBPlayer1 teamBPlayer2 teamAID teamBID
1         Jack         Jill         Matt        Megan       1       3
2         Jill         Jack        Megan         Matt       1       3
3        Megan         Jill         Matt         Jack       2       5
4        Megan         Matt         Jill         Jack       3       1
5        Megan         Jack         Jill         Matt       4       6

答案 1 :(得分:0)

您的输出与您的输入不符(请参阅最后一行),但我认为这样可以满足您的需求:

df <- read.table(text="teamAPlayer1    teamAPlayer2    teamBPlayer1    teamBPlayer2
Jack            Jill            Matt            Megan
Jill            Jack            Megan           Matt
Megan           Jill            Matt            Jack
Megan           Matt            Jill            Jack
Megan           Jack            Jill            Matt",stringsAsFactors=FALSE,header=TRUE)

dt_concat <- matrix(unlist(t(df)),ncol=2,byrow=TRUE) %>% # create a two column matrix with team compositions
  cbind(.,team = apply(.,1,. %>% sort %>% paste(collapse=" "))) %>% as.data.table # add column with sorted team members in a string
dt_concat[, teamID := .GRP, by = team] # attribute ids
df %<>% cbind(dt_concat$teamID %>% matrix(ncol=2,byrow=TRUE) %>% set_colnames(c("teamAID","teamBID"))) # add ids to original df

#   teamAPlayer1 teamAPlayer2 teamBPlayer1 teamBPlayer2 teamAID teamBID
# 1         Jack         Jill         Matt        Megan       1       2
# 2         Jill         Jack        Megan         Matt       1       2
# 3        Megan         Jill         Matt         Jack       3       4
# 4        Megan         Matt         Jill         Jack       2       1
# 5        Megan         Jack         Jill         Matt       5       6

答案 2 :(得分:0)

以下是使用pminpmax

的简单解决方案
v1 <- paste(do.call(pmin, df[c(1:2)]), do.call(pmax, df[c(1:2)]))
v2 <- paste(do.call(pmin, df[c(3:4)]), do.call(pmax, df[c(3:4)]))
v3 <- unique(c(rbind(v1, v2)))

teamAID <- match(v1, v3)
#[1] 1 1 3 2 5

teamBID <- match(v2, v3)
#[1] 2 2 4 1 6

答案 3 :(得分:0)

请允许我建议您完全重塑原始数据。

library(data.table)
library(magrittr)
setDT(df)

df %>%
  .[, Round := 1:.N] %>%
  .[]  # this is only here to view the result

   teamAPlayer1 teamAPlayer2 teamBPlayer1 teamBPlayer2 Round
1:         Jack         Jill         Matt        Megan     1
2:         Jill         Jack        Megan         Matt     2
3:        Megan         Jill         Matt         Jack     3
4:        Megan         Matt         Jill         Jack     4
5:        Megan         Jack         Jill         Matt     5

也就是说,原始数据中的每一行都由Round(锦标赛轮次)标识。然后,您可以重塑数据:

df %>%
  .[, Round := 1:.N] %>%
  melt.data.table(id.vars = "Round",
                  value.name = "participant") %>%
  .[, Event := gsub("team([AB]).*$", "\\1", variable)] %>%
  # Ordering by participant necessary to define
  # distinct combinations  JackJill == JillJack
  .[order(Round, participant, Event)] %>%
  .[,
    .(Team = paste0(participant, collapse = "")),
    keyby = .(Round, Event)]

    Round Event      Team
 1:     1     A  JackJill
 2:     1     B MattMegan
 3:     2     A  JackJill
 4:     2     B MattMegan
 5:     3     A JillMegan
 6:     3     B  JackMatt
 7:     4     A MattMegan
 8:     4     B  JackJill
 9:     5     A JackMegan
10:     5     B  JillMatt

这种格式有很多优点。例如,您可以添加另一列“分数”,它将明确地引用特定游戏,而不是依赖于列的顺序。但是,如果您想要更接近原作的内容,可以随时dcast

df %>%
  .[, Round := 1:.N] %>%
  melt.data.table(id.vars = "Round",
                  value.name = "participant") %>%
  .[, Event := gsub("team([AB]).*$", "\\1", variable)] %>%
  # Ordering by participant necessary to define
  # distinct combinations  JackJill == JillJack
  .[order(Round, participant, Event)] %>%
  .[,
    .(Team = paste0(participant, collapse = "")),
    keyby = .(Round, Event)] %>%
  dcast.data.table(Round ~ Event)

   Round         A         B
1:     1  JackJill MattMegan
2:     2  JackJill MattMegan
3:     3 JillMegan  JackMatt
4:     4 MattMegan  JackJill
5:     5 JackMegan  JillMatt