我有一个包含5列的数据框,其中4列是“正常”值,第5列是一组用定界符分隔的键/值对,格式为
key1 = value1 | key2 = value2 | key3 = value3
单行中没有重复的键,但是不能保证每行中键的顺序相同或键/值对的数量在每一行中都相同。
最简单的方法是将值分成约15列(以其键正确命名)?
答案 0 :(得分:2)
这是使用tidyr,dplyr和stringr保持非键值列完整的一种选择。
创建测试数据
testdf <- data_frame(col1 = paste0("a", 1:4), col2 = paste0("b", 1:4),
col3 = paste0("c", 5:8),
col4 = paste0("enst", 12099:12102),
col5 = c("bang=1 | dong=2 | foo=6", "bang=2 | dong=3 | zups=foo",
"dimbo=2 | bang=6 | jumbo=7 | zoo=camel", "boo=0"))
代码
library(stringr)
library(tidyr)
library(dplyr)
KVsep <- fixed(" | ") #key-value separator
Vsep <- fixed("=") #value separator
newDF <- testdf %>%
mutate(KVpairs = str_split(col5, KVsep)) %>%
unnest(KVpairs) %>%
separate(KVpairs, into = c("key", "value"), Vsep) %>%
spread(key, value) %>%
select(-col5)
结果
> newDF
# A tibble: 4 x 12
col1 col2 col3 col4 bang boo dimbo dong foo jumbo zoo zups
* <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 a1 b1 c5 enst12099 1 NA NA 2 6 NA NA NA
2 a2 b2 c6 enst12100 2 NA NA 3 NA NA NA foo
3 a3 b3 c7 enst12101 6 NA 2 NA NA 7 camel NA
4 a4 b4 c8 enst12102 NA 0 NA NA NA NA NA NA
>
答案 1 :(得分:1)
好吧,我不认为有任何专门的功能,但是进行一些编程并不复杂。
c2nv <- function(x){
# part 1 - convert vector to name-value pairs
x <- strsplit(x, "|", fixed=TRUE) # split at |, then split at =
x <- sapply(x, function(.) t(trimws(sapply(lapply(., strsplit, "="), "[[", 1))))
x1 <- lapply(x, function(.) setNames( .[,2], .[,1]))
# part 2 - find all existing names in all rows
u <- sort(unique(unlist(lapply(x1, names)))) # unique keys
res <- t(sapply(x1, function(.) .[u])) # look at each key in each row
colnames(res) <- u
res
}
col <- c("bang=1 | dong = 2 | foo = 6", "bang1=2 | dong = 3 | zups = foo",
"dimbo = 2 | bang = 6 | jumbo = 7 | zoo = camel", "boo = 0")
> c2nv (col)
bang bang1 boo dimbo dong foo jumbo zoo zups
[1,] "1" NA NA NA "2" "6" NA NA NA
[2,] NA "2" NA NA "3" NA NA NA "foo"
[3,] "6" NA NA "2" NA NA "7" "camel" NA
[4,] NA NA "0" NA NA NA NA NA NA
>