对于虚拟数据集
require(data.table)
require(reshape2)
teamid <- c(1,2,3)
member <- c("a,b","","c,g,h")
leader <- c("c", "d,e", "")
dt <- data.table(teamid, member, leader)
现在数据集如下所示:
teamid member leader 1: 1 a,b c 2: 2 d,e 3: 3 c,g,h
3列。对于每个团队,他们都有团队成员和不同专栏的团队领导。团队可能只有没有领导者的成员,反之亦然。
以下是我的 ALMOST 所需的输出:
teamid value leader 1: 1 a FALSE 2: 1 b FALSE 3: 1 c TRUE 4: 1 c TRUE 5: 2 d TRUE 6: 2 e TRUE 7: 3 c FALSE 8: 3 g FALSE 9: 3 h FALSE
我希望将两列合并为一列,并添加一个标记(如果一个是团队领导者)。
我有一个丑陋的解决方案,
dt1 <- dt[, strsplit(member, ","), by = teamid]
dt2 <- dt[, strsplit(leader, ","), by = teamid]
setkey(dt1,teamid)
setkey(dt2,teamid)
dt3 <- merge(dt1,dt2, all = TRUE)
dt4 <- melt(dt3, id = 1, measure = c("V1.x", "V1.y"))
dt5 <- dt4[value!="NA_real"]
dt6 <- dt5[, leader := (variable == "V1.y")][, variable := NULL]
setkey(dt6, teamid)
setnames(dt6,value,member)
的问题:
这个解决方案效率不高我认为,首先合并然后融化。那么有关其他方法的任何想法吗?
第3行和第4行有重复的行。
当我尝试更改列名时,出现错误
setnames(DT6,值,部件)
setnames(dt6,value,member)出错:找不到对象'value'
也许是最重要的事情,
当我尝试测试我的真实数据集时,其中有超过1百万行,3列发生了以下错误
合并(df1,df2,all = TRUE) vecseq中的错误(f __,len __,if(allow.cartesian)其他为as.integer(max(nrow(x),: 加入238797行的结果;超过142095 = max(nrow(x),nrow(i))。检查i中的重复键值,每个键值一遍又一遍地连接到x中的同一组。如果没关系,请尝试包括
j
并删除by
(by-without-by),以便为每个组运行j以避免大量分配。如果您确定要继续,请使用allow.cartesian = TRUE重新运行。否则,请在FAQ,Wiki,Stack Overflow和datatable-help中搜索此错误消息以获取建议。
有什么建议吗?非常感谢!
答案 0 :(得分:2)
首先融化。
result <- melt(dt,id="teamid", variable.name="status", value.name="member")
result <- result[nchar(member)>0,strsplit(member,","),by=list(teamid,status)]
setnames(result,"V1","member")
setkey(result,teamid,status)
result
# teamid status member
# 1: 1 member a
# 2: 1 member b
# 3: 1 leader c
# 4: 2 leader d
# 5: 2 leader e
# 6: 3 member c
# 7: 3 member g
# 8: 3 member h
如果您想删除status
列并添加&#34;标记&#34;在成员列中,您可以这样做:
result[status=="leader",member:=paste0(member,"*")]
result[,status:=NULL]
result
# teamid member
# 1: 1 a
# 2: 1 b
# 3: 1 c*
# 4: 2 d*
# 5: 2 e*
# 6: 3 c
# 7: 3 g
# 8: 3 h
答案 1 :(得分:0)
稍微简单的方法可能是
crew <- dt[, .(strsplit(member, ","))]
crew <- unlist(crew)
leads <- dt[, .(strsplit(leader, ","))]
leads <- unlist(leads)
dt_long <- data.table(people=c(crew, leads),
status = rep(c("crew", "leader"), c(length(crew), length(leader))))
它给了我
people status
1: a crew
2: b crew
3: c crew
4: g crew
5: h crew
6: c leader
7: d leader
8: e leader
答案 2 :(得分:0)
你现在可以试试一个整齐的解决方案
dt %>%
separate_rows(member) %>%
separate_rows(leader) %>%
gather(status, member, -teamid) %>%
distinct() %>%
filter(member != "") %>%
mutate(member=ifelse(status == "leader", paste0(member, "*"), member)) %>%
select(-status)
teamid member
1 1 a
2 1 b
3 3 c
4 3 g
5 3 h
6 1 c*
7 2 d*
8 2 e*