我有一个数据集,它有一个奇怪的报告格式,我需要把它变成一个可行的数据帧。我正在使用的数据如下所示:
ids<-(c("A101","","","","B101","","","C101","","",""))
dx<-c("Lung","","","","Kidney","","","Prostate","","","")
alt<-c("","A766","G283","F933","","B293","T432","","U920","D289","S203")
val<-c(NA,3.2,4.3,7.2,NA,2.1,3.8,NA,8.1,5.3,7.1)
df.in<-data.frame(ids,dx,alt,val)
生成一个格式,其中包含一系列未对齐的数据到样本ID。我希望它们以这样的方式对齐,即最终的数据框看起来像这样:
ids<-(c("A101","A101","A101","B101","B101","C101","C101","C101"))
dx<-c("Lung","Lung","Lung","Kidney","Kidney","Prostate","Prostate","Prostate")
alt<-c("A766","G283","F933","B293","T432","U920","D289","S203")
val<-c(3.2,4.3,7.2,2.1,3.8,8.1,5.3,7.1)
df.out<-data.frame(ids,dx,alt,val)
我使用plyr,lapply探索了不同的方法,但似乎看起来不像是&#39; df.out&#39;以上数据格式。请注意,样本可能具有的值的数量没有对称性(即,某些值可能只有1个值,而其他值可能最多为10个)。关于如何处理这个的任何想法?
答案 0 :(得分:1)
tidyr
和dplyr
的一种方式:
library(dplyr)
library(tidyr)
# Replace blank cells "" with NA
df.in[df.in == ""] <- NA
# Fill NA values with value of row above it
df.in %>%
fill(c(ids, dx), .direction = "down") %>%
drop_na() %>%
mutate_if(is.factor, as.character) # optional
# A tibble: 8 x 4
ids dx alt val
<chr> <chr> <chr> <dbl>
1 A101 Lung A766 3.20
2 A101 Lung G283 4.30
3 A101 Lung F933 7.20
4 B101 Kidney B293 2.10
5 B101 Kidney T432 3.80
6 C101 Prostate U920 8.10
7 C101 Prostate D289 5.30
8 C101 Prostate S203 7.10
链中的最后一行mutate_if(is.factor, as.character)
是可选的,并将因子转换为字符。我们可以在创建数据集时使用stringsAsFactors = FALSE
来避免此步骤。
答案 1 :(得分:0)
> indx=rep(which(is.na(df.in$val)),rle(cumsum(is.na(df.in$val)))$length)
> na.omit(cbind(df.in[indx,-4],val=df.in$val))
ids dx alt val
1.1 A101 Lung 3.2
1.2 A101 Lung 4.3
1.3 A101 Lung 7.2
5.1 B101 Kidney 2.1
5.2 B101 Kidney 3.8
8.1 C101 Prostate 8.1
8.2 C101 Prostate 5.3
8.3 C101 Prostate 7.1
故障:
> first<-which(is.na(df.in$val))# The positions for every new group ie 1,5 and 8
> groups=cumsum(is.na(df.in$val))#The groups you have
> groupsize=rle(groups)$length#The size of the groups
> newdf=transform(df.in[rep(first,groupsize),],val=df.in$val)#Create the new df
> newdf=na.omit(newdf)#Remove the NA rows
> row.names(newdf)=NULL# REMOVE THE ROWNAMES GIVEN
> newdf
ids dx alt val
1 A101 Lung 3.2
2 A101 Lung 4.3
3 A101 Lung 7.2
4 B101 Kidney 2.1
5 B101 Kidney 3.8
6 C101 Prostate 8.1
7 C101 Prostate 5.3
8 C101 Prostate 7.1