R字符串拆分,使用运行索引的规范化(长)格式

时间:2016-12-17 15:58:39

标签: r strsplit normalize

我有这个数据框

structure(list(rule.id = c(1, 2), rules = structure(1:2, .Label = c("Lamp1.1,Lamp1.2", 
"Lamp2.1,Lamp2.2"), class = "factor")), .Names = c("rule.id", 
"rules"), row.names = c(NA, -2L), class = "data.frame")

#  rule.id           rules
#1       1 Lamp1.1,Lamp1.2
#2       2 Lamp2.1,Lamp2.2

我需要通过分隔符逗号(“,”)在“rules”列上拆分,出现多个逗号(不仅类似于示例中的2个),然后将其转换为规范化格式并保留相关的rule.id值来自原来的df。 结果应如下所示:

structure(list(rule.id = c(1, 1, 2, 2), lhs = c("Lamp1.1", "Lamp1.2", 
"Lamp2.1", "Lamp2.1")), .Names = c("rule.id", "lhs"), row.names = c(NA, 
-4L), class = "data.frame")

#  rule.id     lhs
#1       1 Lamp1.1
#2       1 Lamp1.2
#3       2 Lamp2.1
#4       2 Lamp2.1

我有一个代码来处理str split和normalize(long)格式,但不知道如何处理rule.id要求

lhs.norm <- as.data.frame(
  cbind(
    rules.df$ruleid, 
    unlist(strsplit(
      unlist(lapply(strsplit(unlist(lapply(as.character(rules.df$rules),function(x) substr(x,2,nchar(x)))), "} =>", fixed = T), function(x) x[1]))
      ,","))))

感谢使用

的@acrun解决方案
cSplit(rules.df.lhs, "lhs", ",", "long"))

我为1M行基准测试了19秒(结果大约是2M行)

1 个答案:

答案 0 :(得分:1)

我们可以使用cSplit

中的splitstackshape
library(splitstackshape)
cSplit(df, "rules", ",", "long")
#   rule.id   rules
#1:       1 Lamp1.1
#2:       1 Lamp1.2
#3:       2 Lamp2.1
#4:       2 Lamp2.2

如果这是一个庞大的数据集,我们可以使用stringi来分割

library(stringi)
lst <- stri_split_fixed(df$rules, ",")
df2 <- data.frame(rule.id = rep(df$rule.id, lengths(lst)),
                  rules = unlist(lst))
df2
#   rule.id   rules
#1       1 Lamp1.1
#2       1 Lamp1.2
#3       2 Lamp2.1
#4       2 Lamp2.2

另一个选项是data.table

library(data.table)
setDT(df)[, strsplit(as.character(rules), ","), by = rule.id]