通过合并数据框中以“ - ”分隔的多个列来创建新列

时间:2015-04-19 07:58:29

标签: r dataframe

如何通过在数据框中将多个列与“ - ”分隔的元素组合来创建新列。

df = read.table(text = "

   chr  start       end         CNA
    1  68580000    68640000    loss
    3  15360000    16000000    loss
    4  122660000   123500000   gain
    7  48320000    48400000    loss

", sep = "", header = TRUE)

所需的输出是:

chr  start       end        CNA       sample.ID
1   68580000    68640000    loss    1-68580000-68640000
3   15360000    16000000    loss    3-15360000-16000000
4   122660000   123500000   gain    4-122660000-123500000
7   48320000    48400000    loss    7-48320000-48400000

我试过这个

cols <- c("ID","seqnames","start")
df$sample.id  <- do.call(paste,rbind(final[cols]))

我不确定如何用“ - ”分隔新列中的字符。

3 个答案:

答案 0 :(得分:2)

你可以尝试

cols <- c("chr", "start", "end")
df$sample.id <- do.call(paste, c(df[cols], sep="-"))
df
#  chr     start       end  CNA             sample.id
#1   1  68580000  68640000 loss   1-68580000-68640000
#2   3  15360000  16000000 loss   3-15360000-16000000
#3   4 122660000 123500000 gain 4-122660000-123500000
#4   7  48320000  48400000 loss   7-48320000-48400000

或者

do.call(sprintf, c(df[cols], fmt='%s-%s-%s'))

基准

set.seed(24)
df1 <- as.data.frame(matrix(sample(0:1000, 3*1e6, replace=TRUE), ncol=3))

akrun1 <- function() {do.call(paste, c(df1, sep="-"))}
akrun2 <- function() {do.call(sprintf, c(df1, fmt='%s-%s-%s'))}
Mamoun <- function() {apply(df1, 1, paste0, collapse="-")}
library(microbenchmark)
microbenchmark(akrun1(), akrun2(), Mamoun(), unit='relative', times=20L)
#Unit: relative
#    expr      min       lq     mean   median       uq      max neval cld
#akrun1() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    20  a 
#akrun2() 1.003460 1.034647 1.053498 1.052192 1.048754 1.076809    20  a 
#Mamoun() 6.706523 7.460565 6.657434 6.626459 6.416989 6.105361    20   b

数据

df <- structure(list(chr = c(1L, 3L, 4L, 7L), start = c(68580000L, 
15360000L, 122660000L, 48320000L), end = c(68640000L, 16000000L, 
123500000L, 48400000L), CNA = c("loss", "loss", "gain", "loss"
)), .Names = c("chr", "start", "end", "CNA"), class = "data.frame", 
row.names = c(NA, -4L))

答案 1 :(得分:2)

您还可以使用apply

尝试此类似的解决方案
df$sample.ID <- apply(df[, -4], 1, paste0, collapse="-")
df
  chr     start       end  CNA             sample.ID
1   1  68580000  68640000 loss   1-68580000-68640000
2   3  15360000  16000000 loss   3-15360000-16000000
3   4 122660000 123500000 gain 4-122660000-123500000
4   7  48320000  48400000 loss   7-48320000-48400000

答案 2 :(得分:1)

您还可以尝试unite()中的library(tidyr) - 功能 http://rpubs.com/bradleyboehmke/data_wrangling有一些例子。

df <- unite(df, sample.id, chr, start, end, sep = "-")