使用未对齐的数据和间隙重新格式化数据帧

时间:2018-02-19 23:52:21

标签: r

我有一个数据集,它有一个奇怪的报告格式,我需要把它变成一个可行的数据帧。我正在使用的数据如下所示:

ids<-(c("A101","","","","B101","","","C101","","",""))
dx<-c("Lung","","","","Kidney","","","Prostate","","","")
alt<-c("","A766","G283","F933","","B293","T432","","U920","D289","S203")
val<-c(NA,3.2,4.3,7.2,NA,2.1,3.8,NA,8.1,5.3,7.1)
df.in<-data.frame(ids,dx,alt,val)

生成一个格式,其中包含一系列未对齐的数据到样本ID。我希望它们以这样的方式对齐,即最终的数据框看起来像这样:

ids<-(c("A101","A101","A101","B101","B101","C101","C101","C101"))
dx<-c("Lung","Lung","Lung","Kidney","Kidney","Prostate","Prostate","Prostate")
alt<-c("A766","G283","F933","B293","T432","U920","D289","S203")
val<-c(3.2,4.3,7.2,2.1,3.8,8.1,5.3,7.1)
df.out<-data.frame(ids,dx,alt,val)

我使用plyr,lapply探索了不同的方法,但似乎看起来不像是&#39; df.out&#39;以上数据格式。请注意,样本可能具有的值的数量没有对称性(即,某些值可能只有1个值,而其他值可能最多为10个)。关于如何处理这个的任何想法?

2 个答案:

答案 0 :(得分:1)

tidyrdplyr的一种方式:

library(dplyr)
library(tidyr)

# Replace blank cells "" with NA
df.in[df.in == ""] <- NA

# Fill NA values with value of row above it
df.in %>% 
  fill(c(ids, dx), .direction = "down") %>% 
  drop_na() %>% 
  mutate_if(is.factor, as.character) # optional

# A tibble: 8 x 4
  ids   dx       alt     val
  <chr> <chr>    <chr> <dbl>
1 A101  Lung     A766   3.20
2 A101  Lung     G283   4.30
3 A101  Lung     F933   7.20
4 B101  Kidney   B293   2.10
5 B101  Kidney   T432   3.80
6 C101  Prostate U920   8.10
7 C101  Prostate D289   5.30
8 C101  Prostate S203   7.10

链中的最后一行mutate_if(is.factor, as.character)是可选的,并将因子转换为字符。我们可以在创建数据集时使用stringsAsFactors = FALSE来避免此步骤。

答案 1 :(得分:0)

> indx=rep(which(is.na(df.in$val)),rle(cumsum(is.na(df.in$val)))$length)
> na.omit(cbind(df.in[indx,-4],val=df.in$val))
     ids       dx alt val
1.1 A101     Lung     3.2
1.2 A101     Lung     4.3
1.3 A101     Lung     7.2
5.1 B101   Kidney     2.1
5.2 B101   Kidney     3.8
8.1 C101 Prostate     8.1
8.2 C101 Prostate     5.3
8.3 C101 Prostate     7.1

故障:

> first<-which(is.na(df.in$val))# The positions for every new group ie 1,5 and 8
> groups=cumsum(is.na(df.in$val))#The groups you have
> groupsize=rle(groups)$length#The size of the groups
> newdf=transform(df.in[rep(first,groupsize),],val=df.in$val)#Create the new df
> newdf=na.omit(newdf)#Remove the NA rows
> row.names(newdf)=NULL# REMOVE THE ROWNAMES GIVEN
> newdf
   ids       dx alt val
1 A101     Lung     3.2
2 A101     Lung     4.3
3 A101     Lung     7.2
4 B101   Kidney     2.1
5 B101   Kidney     3.8
6 C101 Prostate     8.1
7 C101 Prostate     5.3
8 C101 Prostate     7.1