R:拆分并复制一行

时间:2016-05-03 19:39:37

标签: r strsplit

我的数据框有一列我希望用破折号分割,重复的行包含破折号左边和右边的字符。我知道如何拆分和复制,但无法弄清楚如何保留部分字符串。非常可怕的描述 - 我认为它更容易显示数据框和所需的输出。

tmp = structure(list(Unit.Types = c("10 - 12 Pack 11.2 - 14.9 oz Bottle or Can", 
"8 - 12 Pack 11.5 - 16 oz Bottle or Can"), Row.Count = c("899", 
"305"), Test = c("B", "A")), .Names = c("Unit.Types", "Row.Count", 
"Test"), row.names = c(104L, 196L), class = "data.frame") 

library(tidyr)
library(dplyr)

tmp2 = tmp %>% mutate(Unit.Types = strsplit(as.character(Unit.Types), "-")) %>% unnest(Unit.Types)
tmp2

  Row.Count Test             Unit.Types
1       899    B                    10 
2       899    B          12 Pack 11.2 
3       899    B  14.9 oz Bottle or Can
4       305    A                     8 
5       305    A          12 Pack 11.5 
6       305    A    16 oz Bottle or Can

我想要的输出应该如下所示:

                                 Unit.Types Row.Count Test
1 10 Pack 11.2 oz Bottle or Can       899    B
2 10 Pack 14.9 oz Bottle or Can       899    B
3 12 Pack 11.2 oz Bottle or Can       899    B
4 12 Pack 14.9 oz Bottle or Can       899    B
5 8 Pack 11.5 oz Bottle or Can       305    A
6 8 Pack 16 oz Bottle or Can       305    A
7 12 Pack 11.5 oz Bottle or Can       305    A
8 12 Pack 16 oz Bottle or Can       305    A

或者至少是这样,用" oz"

                                 Unit.Types Row.Count Test
1 10 - 12 Pack 11.2 oz Bottle or Can       899    B
2 10 - 12 Pack 14.9 oz Bottle or Can       899    B
3 8 - 12 Pack 11.5 oz Bottle or Can       305    A
4 8 - 12 Pack 16 oz Bottle or Can       305    A

非常感谢任何帮助!!

1 个答案:

答案 0 :(得分:1)

看看这个功能

f <- function(x){
    strsplit(x, " Pack | oz Bottle or Can")[[1]] %>%
    strsplit(" - ") %>%
    expand.grid() %>%
    mutate(V = paste(Var1, "Pack", Var2, "oz Bottle or Can")) %>%
    `[[`("V")
}

它将应用于Unit.Types列中的字符串。例如:

> f(tmp$Unit.Types[[1]])
[1] "10 Pack 11.2 oz Bottle or Can" "12 Pack 11.2 oz Bottle or Can"
[3] "10 Pack 14.9 oz Bottle or Can" "12 Pack 14.9 oz Bottle or Can"

然后使用此功能我们可以执行以下操作:

ans <- tmp %>% split(1:nrow(tmp)) %>%
lapply(function(x) data.frame(Unit.Types = f(x$Unit.Types),
                              Row.Count = x$Row.Count,
                              Test = x$Test
                              )
       ) %>%
do.call(rbind, .)
row.names(ans) <- NULL

ans是我们想要的data.frame。

UPD 关于您的评论:我们可以使用匹配' - '分隔的数字对的正则表达式,或仅使用数字并用其重写f

regex <- "[0-9]+(.[0-9]+)?( - [0-9]+(.[0-9]+)?)?"

f <- function(x){
    m <- gregexpr(regex, x)
    matches <- regmatches(x, m)[[1]]
    nonmatches <- regmatches(x, m, invert = T)[[1]][-1]
    strsplit(matches, " - ") %>%
    expand.grid(stringsAsFactors = F) %>%
    apply(MARGIN = 1, function(y) rbind(y, nonmatches) %>%
                                  c %>%
                                  paste(collapse = ""))
}

此函数甚至可以处理具有三个或更多数字规范的字符串:

> x <- "2 - 3 big packs of 10 - 12 Pack 11.2 - 14.9 oz Can"
> f(x)
[1] "2 big packs of 10 Pack 11.2 oz Can" "3 big packs of 10 Pack 11.2 oz Can"
[3] "2 big packs of 12 Pack 11.2 oz Can" "3 big packs of 12 Pack 11.2 oz Can"
[5] "2 big packs of 10 Pack 14.9 oz Can" "3 big packs of 10 Pack 14.9 oz Can"
[7] "2 big packs of 12 Pack 14.9 oz Can" "3 big packs of 12 Pack 14.9 oz Can"