我有这个数据框
structure(list(rule.id = c(1, 2), rules = structure(1:2, .Label = c("Lamp1.1,Lamp1.2",
"Lamp2.1,Lamp2.2"), class = "factor")), .Names = c("rule.id",
"rules"), row.names = c(NA, -2L), class = "data.frame")
# rule.id rules
#1 1 Lamp1.1,Lamp1.2
#2 2 Lamp2.1,Lamp2.2
我需要通过分隔符逗号(“,”)在“rules”列上拆分,出现多个逗号(不仅类似于示例中的2个),然后将其转换为规范化格式并保留相关的rule.id值来自原来的df。 结果应如下所示:
structure(list(rule.id = c(1, 1, 2, 2), lhs = c("Lamp1.1", "Lamp1.2",
"Lamp2.1", "Lamp2.1")), .Names = c("rule.id", "lhs"), row.names = c(NA,
-4L), class = "data.frame")
# rule.id lhs
#1 1 Lamp1.1
#2 1 Lamp1.2
#3 2 Lamp2.1
#4 2 Lamp2.1
我有一个代码来处理str split和normalize(long)格式,但不知道如何处理rule.id要求
lhs.norm <- as.data.frame(
cbind(
rules.df$ruleid,
unlist(strsplit(
unlist(lapply(strsplit(unlist(lapply(as.character(rules.df$rules),function(x) substr(x,2,nchar(x)))), "} =>", fixed = T), function(x) x[1]))
,","))))
感谢使用
的@acrun解决方案cSplit(rules.df.lhs, "lhs", ",", "long"))
我为1M行基准测试了19秒(结果大约是2M行)
答案 0 :(得分:1)
我们可以使用cSplit
splitstackshape
library(splitstackshape)
cSplit(df, "rules", ",", "long")
# rule.id rules
#1: 1 Lamp1.1
#2: 1 Lamp1.2
#3: 2 Lamp2.1
#4: 2 Lamp2.2
如果这是一个庞大的数据集,我们可以使用stringi
来分割
library(stringi)
lst <- stri_split_fixed(df$rules, ",")
df2 <- data.frame(rule.id = rep(df$rule.id, lengths(lst)),
rules = unlist(lst))
df2
# rule.id rules
#1 1 Lamp1.1
#2 1 Lamp1.2
#3 2 Lamp2.1
#4 2 Lamp2.2
另一个选项是data.table
library(data.table)
setDT(df)[, strsplit(as.character(rules), ","), by = rule.id]