我有一个以下格式的数据框:
i j score
chr12-100000000 chr12.100000000 0.333000
chr12-100000000 chr12.100050000 0.169200
chr12-100000000 chr12.100100000 0.054980
我希望将列分隔为:
chr_firstside position_firstside chr_secondside position_secondside score
chr12 100000000 chr12 100000000 0.333000
chr12 100000000 chr12 100050000 0.169200
chr12 100000000 chr12 100100000 0.054980
我希望它以制表符分隔并在R中实现。我试过这个,但它不起作用:
library(data.table)
setDT(converted)[ , tstrsplit(i '[-]', type.convert=TRUE)]
答案 0 :(得分:2)
有了tidyr,
library(tidyr)
df <- data.frame(i = c("chr12-100000000", "chr12-100000000", "chr12-100000000"),
j = c("chr12.100000000", "chr12.100050000", "chr12.100100000"),
score = c(0.333, 0.1692, 0.05498),
stringsAsFactors = FALSE)
df %>%
separate(i, c('chr_i', 'position_i'), convert = TRUE) %>%
separate(j, c('chr_j', 'position_j'), convert = TRUE)
#> chr_i position_i chr_j position_j score
#> 1 chr12 100000000 chr12 100000000 0.33300
#> 2 chr12 100000000 chr12 100050000 0.16920
#> 3 chr12 100000000 chr12 100100000 0.05498
虽然长篇形式可能更实用,但是:
df_long <- df %>%
gather(var, val, i:j) %>%
separate(val, c('chr', 'position'), convert = TRUE)
df_long
#> score var chr position
#> 1 0.33300 i chr12 100000000
#> 2 0.16920 i chr12 100000000
#> 3 0.05498 i chr12 100000000
#> 4 0.33300 j chr12 100000000
#> 5 0.16920 j chr12 100050000
#> 6 0.05498 j chr12 100100000
...如果你想回到广泛的形式,它可能:
df_wide <- df_long %>%
gather(var2, val, chr:position) %>%
unite(var, var2, var) %>%
spread(var, val, convert = TRUE)
df_wide
#> # A tibble: 3 x 5
#> score chr_i chr_j position_i position_j
#> <dbl> <chr> <chr> <int> <int>
#> 1 0.0550 chr12 chr12 100000000 100100000
#> 2 0.169 chr12 chr12 100000000 100050000
#> 3 0.333 chr12 chr12 100000000 100000000
答案 1 :(得分:2)
base R
选项read.table
将Map
位于前两列,指定sep
的相应read.table
分隔为多列,cbind
输出后list
,然后在使用所需列名称('nm1')
cbind
nm1 <- paste0(c('chr_', 'position_'), rep(c('firstside', 'secondside'), each = 2))
cbind(setNames(do.call(cbind, Map(read.table, text=df[1:2],
sep = list("-", "."))), nm1), df['score'])
# chr_firstside position_firstside chr_secondside position_secondside score
#1 chr12 100000000 chr12 100000000 0.33300
#2 chr12 100000000 chr12 100050000 0.16920
#3 chr12 100000000 chr12 100100000 0.05498
答案 2 :(得分:1)
使用sub
:
df$chr_firstside <- sub("^([^-]+).*", "\\1", df$i)
df$position_firstside <- sub(".*?([^-]+)$", "\\1", df$i)
df$chr_secondside <- sub("^([^.]+).*", "\\1", df$j)
df$position_secondside <- sub(".*?([^.]+)$", "\\1", df$j)
如果您不再需要,您也可以从数据框中删除i
和j
列:
df <- df[ , -which(names(df) %in% c("i","j"))]
答案 3 :(得分:1)
使用base R
strsplit
:
split_temp <- sapply(lapply(converted[, 1:2], strsplit, "[\\.-]"), unlist)
row_pos <- 1:nrow(split_temp) %% 2 == 0L
converted2 <- data.frame(chr_firstside = split_temp[!row_pos, "i"],
position_firstside = split_temp[row_pos, "i"],
chr_secondside = split_temp[!row_pos, "j"],
position_secondside = split_temp[row_pos, "j"],
score = converted$score)
print(converted2)
chr_firstside position_firstside chr_secondside position_secondside score
1 chr12 100000000 chr12 100000000 0.33300
2 chr12 100000000 chr12 100050000 0.16920
3 chr12 100000000 chr12 100100000 0.05498
答案 4 :(得分:0)
我建议{&#34;}使用我的&#34; splitstackshape&#34; package,允许您提供拆分字符的向量,每个要拆分的列一个。
演示(使用sample data from @alistaire's answer):
cSplit
使用library(splitstackshape)
cSplit(df, c("i", "j"), c("-", "."))
# score i_1 i_2 j_1 j_2
# 1: 0.33300 chr12 100000000 chr12 100000000
# 2: 0.16920 chr12 100000000 chr12 100050000
# 3: 0.05498 chr12 100000000 chr12 100100000
更改列顺序:
setcolorder