我的数据框有一列我希望用破折号分割,重复的行包含破折号左边和右边的字符。我知道如何拆分和复制,但无法弄清楚如何保留部分字符串。非常可怕的描述 - 我认为它更容易显示数据框和所需的输出。
tmp = structure(list(Unit.Types = c("10 - 12 Pack 11.2 - 14.9 oz Bottle or Can",
"8 - 12 Pack 11.5 - 16 oz Bottle or Can"), Row.Count = c("899",
"305"), Test = c("B", "A")), .Names = c("Unit.Types", "Row.Count",
"Test"), row.names = c(104L, 196L), class = "data.frame")
library(tidyr)
library(dplyr)
tmp2 = tmp %>% mutate(Unit.Types = strsplit(as.character(Unit.Types), "-")) %>% unnest(Unit.Types)
tmp2
Row.Count Test Unit.Types
1 899 B 10
2 899 B 12 Pack 11.2
3 899 B 14.9 oz Bottle or Can
4 305 A 8
5 305 A 12 Pack 11.5
6 305 A 16 oz Bottle or Can
我想要的输出应该如下所示:
Unit.Types Row.Count Test
1 10 Pack 11.2 oz Bottle or Can 899 B
2 10 Pack 14.9 oz Bottle or Can 899 B
3 12 Pack 11.2 oz Bottle or Can 899 B
4 12 Pack 14.9 oz Bottle or Can 899 B
5 8 Pack 11.5 oz Bottle or Can 305 A
6 8 Pack 16 oz Bottle or Can 305 A
7 12 Pack 11.5 oz Bottle or Can 305 A
8 12 Pack 16 oz Bottle or Can 305 A
或者至少是这样,用" oz"
Unit.Types Row.Count Test
1 10 - 12 Pack 11.2 oz Bottle or Can 899 B
2 10 - 12 Pack 14.9 oz Bottle or Can 899 B
3 8 - 12 Pack 11.5 oz Bottle or Can 305 A
4 8 - 12 Pack 16 oz Bottle or Can 305 A
非常感谢任何帮助!!
答案 0 :(得分:1)
看看这个功能
f <- function(x){
strsplit(x, " Pack | oz Bottle or Can")[[1]] %>%
strsplit(" - ") %>%
expand.grid() %>%
mutate(V = paste(Var1, "Pack", Var2, "oz Bottle or Can")) %>%
`[[`("V")
}
它将应用于Unit.Types
列中的字符串。例如:
> f(tmp$Unit.Types[[1]])
[1] "10 Pack 11.2 oz Bottle or Can" "12 Pack 11.2 oz Bottle or Can"
[3] "10 Pack 14.9 oz Bottle or Can" "12 Pack 14.9 oz Bottle or Can"
然后使用此功能我们可以执行以下操作:
ans <- tmp %>% split(1:nrow(tmp)) %>%
lapply(function(x) data.frame(Unit.Types = f(x$Unit.Types),
Row.Count = x$Row.Count,
Test = x$Test
)
) %>%
do.call(rbind, .)
row.names(ans) <- NULL
ans
是我们想要的data.frame。
UPD 关于您的评论:我们可以使用匹配' - '
分隔的数字对的正则表达式,或仅使用数字并用其重写f
。
regex <- "[0-9]+(.[0-9]+)?( - [0-9]+(.[0-9]+)?)?"
f <- function(x){
m <- gregexpr(regex, x)
matches <- regmatches(x, m)[[1]]
nonmatches <- regmatches(x, m, invert = T)[[1]][-1]
strsplit(matches, " - ") %>%
expand.grid(stringsAsFactors = F) %>%
apply(MARGIN = 1, function(y) rbind(y, nonmatches) %>%
c %>%
paste(collapse = ""))
}
此函数甚至可以处理具有三个或更多数字规范的字符串:
> x <- "2 - 3 big packs of 10 - 12 Pack 11.2 - 14.9 oz Can"
> f(x)
[1] "2 big packs of 10 Pack 11.2 oz Can" "3 big packs of 10 Pack 11.2 oz Can"
[3] "2 big packs of 12 Pack 11.2 oz Can" "3 big packs of 12 Pack 11.2 oz Can"
[5] "2 big packs of 10 Pack 14.9 oz Can" "3 big packs of 10 Pack 14.9 oz Can"
[7] "2 big packs of 12 Pack 14.9 oz Can" "3 big packs of 12 Pack 14.9 oz Can"