拆分多列中的列并保留下一列

时间:2018-01-16 17:01:47

标签: r regex dataframe multiple-columns

我有一个以下格式的数据框:

i               j               score
chr12-100000000 chr12.100000000 0.333000
chr12-100000000 chr12.100050000 0.169200
chr12-100000000 chr12.100100000 0.054980

我希望将列分隔为:

chr_firstside   position_firstside  chr_secondside  position_secondside score
chr12           100000000           chr12           100000000           0.333000
chr12           100000000           chr12           100050000           0.169200
chr12           100000000           chr12           100100000           0.054980

我希望它以制表符分隔并在R中实现。我试过这个,但它不起作用:

library(data.table)
setDT(converted)[ , tstrsplit(i '[-]', type.convert=TRUE)]

5 个答案:

答案 0 :(得分:2)

有了tidyr,

library(tidyr)

df <- data.frame(i = c("chr12-100000000", "chr12-100000000", "chr12-100000000"), 
                 j = c("chr12.100000000", "chr12.100050000", "chr12.100100000"), 
                 score = c(0.333, 0.1692, 0.05498),
                 stringsAsFactors = FALSE)

df %>% 
    separate(i, c('chr_i', 'position_i'), convert = TRUE) %>% 
    separate(j, c('chr_j', 'position_j'), convert = TRUE)
#>   chr_i position_i chr_j position_j   score
#> 1 chr12  100000000 chr12  100000000 0.33300
#> 2 chr12  100000000 chr12  100050000 0.16920
#> 3 chr12  100000000 chr12  100100000 0.05498

虽然长篇形式可能更实用,但是:

df_long <- df %>% 
    gather(var, val, i:j) %>% 
    separate(val, c('chr', 'position'), convert = TRUE) 

df_long
#>     score var   chr  position
#> 1 0.33300   i chr12 100000000
#> 2 0.16920   i chr12 100000000
#> 3 0.05498   i chr12 100000000
#> 4 0.33300   j chr12 100000000
#> 5 0.16920   j chr12 100050000
#> 6 0.05498   j chr12 100100000

...如果你想回到广泛的形式,它可能:

df_wide <- df_long %>% 
    gather(var2, val, chr:position) %>% 
    unite(var, var2, var) %>%
    spread(var, val, convert = TRUE)

df_wide
#> # A tibble: 3 x 5
#>    score chr_i chr_j position_i position_j
#>    <dbl> <chr> <chr>      <int>      <int>
#> 1 0.0550 chr12 chr12  100000000  100100000
#> 2 0.169  chr12 chr12  100000000  100050000
#> 3 0.333  chr12 chr12  100000000  100000000

答案 1 :(得分:2)

base R选项read.tableMap位于前两列,指定sep的相应read.table分隔为多列,cbind输出后list,然后在使用所需列名称('nm1')

重命名列后,使用'得分'列cbind
nm1 <- paste0(c('chr_', 'position_'), rep(c('firstside', 'secondside'), each = 2))
cbind(setNames(do.call(cbind, Map(read.table, text=df[1:2],  
               sep = list("-", "."))), nm1), df['score'])
#  chr_firstside position_firstside chr_secondside position_secondside   score
#1         chr12          100000000          chr12           100000000 0.33300
#2         chr12          100000000          chr12           100050000 0.16920
#3         chr12          100000000          chr12           100100000 0.05498

答案 2 :(得分:1)

使用sub

df$chr_firstside <- sub("^([^-]+).*", "\\1", df$i)
df$position_firstside <- sub(".*?([^-]+)$", "\\1", df$i)
df$chr_secondside <- sub("^([^.]+).*", "\\1", df$j)
df$position_secondside <- sub(".*?([^.]+)$", "\\1", df$j)

如果您不再需要,您也可以从数据框中删除ij列:

df <- df[ , -which(names(df) %in% c("i","j"))]

Demo

答案 3 :(得分:1)

使用base R strsplit

split_temp <- sapply(lapply(converted[, 1:2], strsplit, "[\\.-]"), unlist)
row_pos <- 1:nrow(split_temp) %% 2 == 0L
converted2 <- data.frame(chr_firstside       = split_temp[!row_pos, "i"],
                         position_firstside  = split_temp[row_pos, "i"],
                         chr_secondside      = split_temp[!row_pos, "j"],
                         position_secondside = split_temp[row_pos, "j"],
                         score               = converted$score)
print(converted2)
  chr_firstside position_firstside chr_secondside position_secondside   score
1         chr12          100000000          chr12           100000000 0.33300
2         chr12          100000000          chr12           100050000 0.16920
3         chr12          100000000          chr12           100100000 0.05498

答案 4 :(得分:0)

我建议{&#34;}使用我的&#34; splitstackshape&#34; package,允许您提供拆分字符的向量,每个要拆分的列一个。

演示(使用sample data from @alistaire's answer):

cSplit

使用library(splitstackshape) cSplit(df, c("i", "j"), c("-", ".")) # score i_1 i_2 j_1 j_2 # 1: 0.33300 chr12 100000000 chr12 100000000 # 2: 0.16920 chr12 100000000 chr12 100050000 # 3: 0.05498 chr12 100000000 chr12 100100000 更改列顺序:

setcolorder