折叠data.table中2个不同列中的行?

时间:2014-06-14 02:47:15

标签: r data.table

对于虚拟数据集

require(data.table)
require(reshape2)
teamid <- c(1,2,3)
member <- c("a,b","","c,g,h")
leader <- c("c", "d,e", "")
dt <- data.table(teamid, member, leader)

现在数据集如下所示:

   teamid member leader
1:      1    a,b      c
2:      2           d,e
3:      3  c,g,h  

3列。对于每个团队,他们都有团队成员和不同专栏的团队领导。团队可能只有没有领导者的成员,反之亦然。

以下是我的 ALMOST 所需的输出:

   teamid value leader
1:      1     a  FALSE
2:      1     b  FALSE
3:      1     c   TRUE
4:      1     c   TRUE
5:      2     d   TRUE
6:      2     e   TRUE
7:      3     c  FALSE
8:      3     g  FALSE
9:      3     h  FALSE

我希望将两列合并为一列,并添加一个标记(如果一个是团队领导者)。

我有一个丑陋的解决方案,

dt1 <- dt[, strsplit(member, ","), by = teamid] dt2 <- dt[, strsplit(leader, ","), by = teamid] setkey(dt1,teamid) setkey(dt2,teamid) dt3 <- merge(dt1,dt2, all = TRUE) dt4 <- melt(dt3, id = 1, measure = c("V1.x", "V1.y")) dt5 <- dt4[value!="NA_real"] dt6 <- dt5[, leader := (variable == "V1.y")][, variable := NULL] setkey(dt6, teamid) setnames(dt6,value,member)

的问题:

  1. 这个解决方案效率不高我认为,首先合并然后融化。那么有关其他方法的任何想法吗?

  2. 第3行和第4行有重复的行。

  3. 当我尝试更改列名时,出现错误

  4.   

    setnames(DT6,值,部件)

         

    setnames(dt6,value,member)出错:找不到对象'value'

    也许是最重要的事情,

    当我尝试测试我的真实数据集时,其中有超过1百万行,3列发生了以下错误

      

    合并(df1,df2,all = TRUE)   vecseq中的错误(f __,len __,if(allow.cartesian)其他为as.integer(max(nrow(x),:     加入238797行的结果;超过142095 = max(nrow(x),nrow(i))。检查i中的重复键值,每个键值一遍又一遍地连接到x中的同一组。如果没关系,请尝试包括j并删除by(by-without-by),以便为每个组运行j以避免大量分配。如果您确定要继续,请使用allow.cartesian = TRUE重新运行。否则,请在FAQ,Wiki,Stack Overflow和datatable-help中搜索此错误消息以获取建议。

    有什么建议吗?非常感谢!

3 个答案:

答案 0 :(得分:2)

首先融化。

result <- melt(dt,id="teamid", variable.name="status", value.name="member")
result <- result[nchar(member)>0,strsplit(member,","),by=list(teamid,status)]
setnames(result,"V1","member")
setkey(result,teamid,status)
result
#    teamid status member
# 1:      1 member      a
# 2:      1 member      b
# 3:      1 leader      c
# 4:      2 leader      d
# 5:      2 leader      e
# 6:      3 member      c
# 7:      3 member      g
# 8:      3 member      h

如果您想删除status列并添加&#34;标记&#34;在成员列中,您可以这样做:

result[status=="leader",member:=paste0(member,"*")]
result[,status:=NULL]
result
#    teamid member
# 1:      1      a
# 2:      1      b
# 3:      1     c*
# 4:      2     d*
# 5:      2     e*
# 6:      3      c
# 7:      3      g
# 8:      3      h

答案 1 :(得分:0)

稍微简单的方法可能是

crew <- dt[, .(strsplit(member, ","))]
crew <- unlist(crew)
leads <- dt[, .(strsplit(leader, ","))]
leads <- unlist(leads)

dt_long <- data.table(people=c(crew, leads), 
    status = rep(c("crew", "leader"), c(length(crew), length(leader))))

它给了我

  people status
1:      a   crew
2:      b   crew
3:      c   crew
4:      g   crew
5:      h   crew
6:      c leader
7:      d leader
8:      e leader

答案 2 :(得分:0)

你现在可以试试一个整齐的解决方案

dt %>% 
  separate_rows(member) %>% 
  separate_rows(leader) %>% 
  gather(status, member, -teamid) %>% 
  distinct() %>% 
  filter(member != "") %>% 
  mutate(member=ifelse(status == "leader", paste0(member, "*"), member)) %>% 
  select(-status)
  teamid member
1      1      a
2      1      b
3      3      c
4      3      g
5      3      h
6      1     c*
7      2     d*
8      2     e*