基于具有不同长度的字符串操纵两个数据帧

时间:2016-07-06 12:50:29

标签: r

我在这里问了一个问题Finding the index based on two data frames of strings,我得到了一个完美的答案。 现在我一直面临着另一个我无法解决的问题。如果我的第二个数据是多个列,那么我可以根据

解决它
setDT(strs)[, c('colids1','colids2') := lapply(.SD, function(x) toString(which(colSums(lut == x, na.rm=TRUE) > 0))), by = 1:nrow(strs)][]

只要我的第二个数据(strs)在所有列中具有相同的长度,这就没问题 但如果它们变化(不是相同的长度),那么这不起作用并给我一个错误。

所以说我的第一个数据是

lut <- structure(list(V1 = c("O75663", "O95400", "O95433", NA, NA), 
    V2 = c("O95456", "O95670", NA, NA, NA), V3 = c("O75663", 
    "O95400", "O95433", "O95456", "O95670"), V4 = c("O95456", 
    "O95670", "O95801", "P00352", NA), V1 = c("O75663", "O95400", 
    "O95433", NA, NA), V2 = c("O95456", "O95670", NA, NA, NA), 
    V3 = c("O75663", "O95400", "O95433", "O95456", "O95670"), 
    V4 = c("O95456", "O95670", "O95801", "P00352", NA)), .Names = c("V1", 
"V2", "V3", "V4", "V1", "V2", "V3", "V4"), row.names = c(NA, 
-5L), class = "data.frame")

我的第二个数据是

strs <- structure(list(strings = structure(c(2L, 3L, 4L, 5L, 6L, 7L, 
1L, 1L), .Label = c("", "O75663", "O95400", "O95433", "O95456", 
"O95670", "O95801"), class = "factor"), strings2 = structure(c(4L, 
2L, 6L, 5L, 3L, 1L, 1L, 1L), .Label = c("", "O75663", "O95433", 
"O95456", "P00352", "P00492"), class = "factor"), strings3 = structure(c(4L, 
6L, 7L, 8L, 2L, 3L, 5L, 1L), .Label = c("", "O75663", "O95400", 
"O95456", "O95670", "O95801", "P00352", "P00492"), class = "factor"), 
    strings4 = structure(c(2L, 5L, 3L, 4L, 1L, 1L, 1L, 1L), .Label = c("", 
    "O95400", "O95456", "O95801", "P00492"), class = "factor"), 
    strings5 = structure(c(8L, 2L, 7L, 1L, 3L, 6L, 5L, 4L), .Label = c("O75663", 
    "O95400", "O95433", "O95456", "O95670", "O95801", "P00352", 
    "P00492"), class = "factor")), .Names = c("strings", "strings2", 
"strings3", "strings4", "strings5"), class = "data.frame", row.names = c(NA, 
-8L))

这就是我试图做的事情

df<- setDT(strs)[, paste0('colids_',seq_along(strs)) := lapply(.SD, function(x) toString(which(colSums(lut == x, na.rm=TRUE) > 0))), by = 1:nrow(strs)][]

如果strs的长度相同,但是当长度变化时它不起作用,它可以工作

2 个答案:

答案 0 :(得分:1)

我从@scentoni那里获取它, rapply lapply 的递归版本,它将所有向量转换为角色。 raptly模式称为 how ,如果将其设置为替换 how =“replace”,则列表中的每个元素本身不是一个列表并且包含一个类类被替换为将 as.character 作为元素应用于该元素的结果。

strs <- rapply(strs, as.character, classes="factor", how="replace")

然后执行

df<- setDT(strs)[, paste0('colids_',seq_along(strs)) := lapply(.SD, function(x) toString(which(colSums(lut == x, na.rm=TRUE) > 0))), by = 1:nrow(strs)][]

答案 1 :(得分:1)

strs中的因子变量转换为字符变量,也可以使用data.table轻松完成。假设您的strs数据集已经是data.table,您应该这样做:

strs[, names(strs) := lapply(.SD, as.character)]

如果strs不是data.table,则应使用:

setDT(strs)[, names(strs) := lapply(.SD, as.character)]

之后,您可以按照自己的意愿执行操作。一切都链在一起,看起来像:

setDT(strs)[, lapply(.SD, as.character)
            ][, paste0('colids_',seq_along(strs)) := lapply(.SD, function(x) toString(which(colSums(lut == x, na.rm=TRUE) > 0))), 
              by = 1:nrow(strs)][]