拆分字符串并将它们堆叠在一列中

时间:2016-10-14 14:33:37

标签: r

我有一个具有这种结构的数据框:

> df
modifications
13-MOD:0057
13-MOD:0046
13-MOD:0051,13-MOD:0076
13-MOD:0036,13-MOD:0076,13-MOD:0016
13-MOD:0256,13-MOD:0156,13-MOD:0956,13-MOD:0125
13-MOD:0014 13-MOD:0156, 13-MOD:0956,13-MOD:0125...n
13-MOD:0012 ... n

要分割我使用此代码的数据:

df2 <- data.frame(str_split_fixed(df$modifications, ",", 20))

基本上,我得到了这些数据。

> df2
x1          | x2           | x3          | empty       |
13-MOD:0057 | empty        | empty       | empty       |
13-MOD:0046 | emply        | empty       | empty       |
13-MOD:0051 | 13-MOD:0076  | empty       | empty       |
13-MOD:0036 | 13-MOD:0076  | 13-MOD:0016 | empty       |
13-MOD:0256 | 13-MOD:0156  | 13-MOD:0956 | 13-MOD:0125
13-MOD:0014 | 13-MOD:0156  | 13-MOD:0956 | 13-MOD:0125  | ... n
13-MOD:0012 | ...          | ...n

我想要的是删除空值并将数据从列X2,X3,X4 ... n堆叠到第一个X1。

为此,我使用了这个:

df3 <- melt(setDT(df2),                       # set df to a data.table
 measure.vars = list(c(1:20)),    # set column groupings
 value.name = 'V')[                      # set output name scheme
   , -1, with = F]

删除空值:

df3[df3==""] <- NA

histo3 =子集(df3,V1!=&#39; NA&#39;)

但我不知道为什么我在熔化函数中得到关于色谱柱长度的错误。你知道如何让这更容易吗?。

可重复的例子:

df <- data.frame(modifications=c("UNIMOD:108,UNIMOD:108","UNIMOD:108","UNIMOD:108","UNIMOD:108,UNIMOD:108,UNIMOD:108","UNIMOD:108,UNIMOD:108,UNIMOD:108,UNIMOD:108,UNIMOD:108,UNIMOD:108","UNIMOD:108"))

1 个答案:

答案 0 :(得分:1)

可能是这样的吗?

library(stringr)

# input dataset
s <- c('13-MOD:0057', '13-MOD:0046', '13-MOD:0051,13-MOD:0076', '13-MOD:0036,13-MOD:0076,13-MOD:0016', '13-MOD:0256,13-MOD:0156,13-MOD:0956,13-MOD:0125')

s
[1] "13-MOD:0057"                                    
[2] "13-MOD:0046"                                    
[3] "13-MOD:0051,13-MOD:0076"                        
[4] "13-MOD:0036,13-MOD:0076,13-MOD:0016"            
[5] "13-MOD:0256,13-MOD:0156,13-MOD:0956,13-MOD:0125"

# get the individual lengths
lengths <- sapply(str_split(s,','), function(x){ length(x) })

# create the dataframe splitting in N columns
as.data.frame(str_split_fixed(s, ',', max(lengths)))

  V1          V2          V3          V4
1 13-MOD:0057                                    
2 13-MOD:0046                                    
3 13-MOD:0051 13-MOD:0076                        
4 13-MOD:0036 13-MOD:0076 13-MOD:0016            
5 13-MOD:0256 13-MOD:0156 13-MOD:0956 13-MOD:0125

更新1 将所有非空单元格堆叠成单个列

 # create the dataframe splitting in N columns
 first.matrix <- str_split_fixed(s, ',', max(lengths))

 # select only the cells != ""  
 first.matrix[which(first.matrix!="")]

[1] "13-MOD:0057" "13-MOD:0046" "13-MOD:0051" "13-MOD:0036" "13-MOD:0256" "13-MOD:0076"
[7] "13-MOD:0076" "13-MOD:0156" "13-MOD:0016" "13-MOD:0956" "13-MOD:0125"