我有以下数据集:
mydata<- data.frame(Factors= c("a,b" , "c,d" , "a,c"), Valu = c ("2,3" , "7,8" , "9,1"))
Factors Valu
1 a,b 2,3
2 c,d 7,8
3 a,c 9,1
我希望转换为具有以下因素的所有值的以下内容:
a b c d
2 2 7 7
3 3 8 8
9 9
1 1
我需要一个数据透视表。但是我需要准备数据,然后使用melt
和dcast
得到我想要的输出:准备数据的失败尝试之一是:
mydata2 <- cSplit(mydata, c("Factors","Valu") , ",", "long")
但是他们失去了联系。
答案 0 :(得分:6)
以下是包含cSplit
library(splitstackshape)
with(cSplit(cSplit(mydata, 1, ",", "long"), 2, ",", "long"), split(Valu, Factors))
#$a
#[1] 2 3 9 1
#$b
#[1] 2 3
#$c
#[1] 7 8 9 1
#$d
#[1] 7 8
如果我们需要data.table/data.frame
,请使用dcast
转换&#39; long&#39;格式为&#39; wide&#39;。
dcast(cSplit(cSplit(mydata, 1, ",", "long"), 2, ",", "long"),
rowid(Factors)~Factors, value.var="Valu")[, Factors := NULL][]
# a b c d
#1: 2 2 7 7
#2: 3 3 8 8
#3: 9 NA 9 NA
#4: 1 NA 1 NA
注意:splitstackshape
加载data.table
。在这里,我们使用了data.table_1.10.0
。来自dcast
的{{1}}也非常fast
答案 1 :(得分:4)
使用几个*apply
,strsplit
和grep
## convert columns to characters so you can use strsplit
mydata$Factors <- as.character(mydata$Factors)
mydata$Valu <- as.character(mydata$Valu)
## get all the unique factor values by splitting them
f <- unique(unlist(strsplit(unique(mydata$Factors), split = ",")))
## filter 'mydata' by using 'grep' to search for each individual factor value
## (using sapply for one at a time)
l <- sapply(f, function(x) mydata[grep(x, mydata$Factors), "Valu"])
这给出了一个列表,其中每个元素都由'Factor'值命名,它包含与之关联的所有'Valu'值
l
# $a
# [1] "2,3" "9,1"
#
# $b
# [1] "2,3"
#
# $c
# [1] "7,8" "9,1"
#
# $d
# [1] "7,8"
此列表中的另一个lapply
将拆分'Valu's
result <- lapply(l, function(x) unlist(strsplit(x, split = ",")))
result
# $a
# [1] "2" "3" "9" "1"
#
# $b
# [1] "2" "3"
#
# $c
# [1] "7" "8" "9" "1"
#
# $d
# [1] "7" "8"
修改强>
要在data.frame中获取结果,您可以使每个列表元素具有相同的长度(通过填充NA
),然后在结果上调用data.frame
## the number of rows required for each column
maxLength <- max(sapply(result, length))
## append 'NA's to list with fewer than maxLenght lements
result <- data.frame(sapply(result, function(x) c(x, rep(NA, maxLength - length(x))) ))
result
# a b c d
# 1 2 2 7 7
# 2 3 3 8 8
# 3 9 <NA> 9 <NA>
# 4 1 <NA> 1 <NA>
修改强>
在回复评论时,如果您有“相似”字符串,则可以使用grep
明确显示( )
正则表达式(有关说明,请参阅any regex cheatsheet)
mydata<- data.frame(Factors= c("a,b" , "c,d" , "a,c", "bo,ao"), Valu = c ("2,3" , "7,8" , "9,1", "x,y"))
mydata$Factors <- as.character(mydata$Factors)
mydata$Valu <- as.character(mydata$Valu)
f <- unique(unlist(strsplit(unique(mydata$Factors), split = ",")))
## filter 'mydata' by using 'grep' to search for each individual factor value
## (using sapply for one at a time)
l <- sapply(f, function(x) mydata[grep(paste0("(",x,")"), mydata$Factors), "Valu"])
答案 2 :(得分:4)
另一个基础R尝试:
# character conversion first
mydata[] <- lapply(mydata, as.character)
long <- do.call(rbind,
do.call(Map, c(expand.grid, lapply(mydata, strsplit, ","), stringsAsFactors=FALSE))
)
split(long$Valu, long$Factors)
#$a
#[1] "2" "3" "9" "1"
#
#$b
#[1] "2" "3"
#
#$c
#[1] "7" "8" "9" "1"
#
#$d
#[1] "7" "8"
答案 3 :(得分:4)
我在上面的评论中误解了;如果您希望每个Factor
匹配每个Valu
,则需要单独分隔列以获取组合。如果你添加指数来传播,那就不错了:
library(tidyverse)
mydata %>%
separate_rows(Factors) %>% separate_rows(Valu, convert = TRUE) %>%
# add indices to give row order when spreading
group_by(Factors) %>% mutate(i = row_number()) %>%
spread(Factors, Valu) %>%
select(-i) # clean up extra column
## # A tibble: 4 × 4
## a b c d
## * <int> <int> <int> <int>
## 1 2 2 7 7
## 2 3 3 8 8
## 3 9 NA 9 NA
## 4 1 NA 1 NA