连接整个数据帧的成对的列

时间:2018-11-07 18:06:15

标签: r data.table tidyverse

我正在处理遗传数据,因此需要连接成对的列。我拥有的数据在单独的列中具有主要和次要等位基因(例如,等位基因1a,等位基因1b,等位基因2a,等位基因2b等)。我需要一种将整个数据框的列对成对的方法。我在下面提供了一个示例,但是我的数据有170万对(因此我现在有340万列),因此如果我需要为每列命名,它将无法工作。稍后我将更改列名称。如果有一种方法可以在R。I have tried to create a sequence and paste them中完成,则可以得到任何指导,例如:

df <- data.frame(id = seq(1,20),
                 var1 = rep("A", 20),
                 var2 = c(rep("T", 10), rep("A", 10)),
                 var3 = rep("C", 20),
                 var4 = c(rep("C", 10), rep("G", 10)),
                 var5 = rep("A", 20),
                 var6 = c(rep("A", 10), rep("G", 10)),
                 stringsAsFactors = FALSE)

i <- seq.int(1, length(ped), by = 2L)
df <- paste0(df[i], df[i+1])

但是没有用。我希望它来自:

    id var1 var2 var3 var4 var5 var6
1   1    A    T    C    C    A    A
2   2    A    T    C    C    A    A
3   3    A    T    C    C    A    A
4   4    A    T    C    C    A    A
5   5    A    T    C    C    A    A
6   6    A    T    C    C    A    A
7   7    A    T    C    C    A    A
8   8    A    T    C    C    A    A
9   9    A    T    C    C    A    A
10 10    A    T    C    C    A    A
11 11    A    A    C    G    A    G
12 12    A    A    C    G    A    G
13 13    A    A    C    G    A    G
14 14    A    A    C    G    A    G
15 15    A    A    C    G    A    G
16 16    A    A    C    G    A    G
17 17    A    A    C    G    A    G
18 18    A    A    C    G    A    G
19 19    A    A    C    G    A    G
20 20    A    A    C    G    A    G

收件人:

   id var1 var2 var3
1   1   AT   CC   AA
2   2   AT   CC   AA
3   3   AT   CC   AA
4   4   AT   CC   AA
5   5   AT   CC   AA
6   6   AT   CC   AA
7   7   AT   CC   AA
8   8   AT   CC   AA
9   9   AT   CC   AA
10 10   AT   CC   AA
11 11   AA   CG   AG
12 12   AA   CG   AG
13 13   AA   CG   AG
14 14   AA   CG   AG
15 15   AA   CG   AG
16 16   AA   CG   AG
17 17   AA   CG   AG
18 18   AA   CG   AG
19 19   AA   CG   AG
20 20   AA   CG   AG

编辑: 谢谢!!!我能够为数据修改两个答案,而@akrun的运行速度更快。我创建了一个包含100行和100,000列的数据子集,结果如下:

microbenchmark(
+   {
+   new <- ped %>%
+   gather(key = V, value = value, -id) %>%
+   mutate(V = str_extract(V, "\\d+") %>% as.numeric()) %>%
+   group_by(id) %>%
+   mutate(pair = ceiling(V / 2)) %>% 
+   group_by(id, pair) %>%
+   summarise(combined = paste(value, collapse = "")) %>%
+   mutate(V_combo = paste0("V", pair)) %>%
+   select(-pair) %>%
+   spread(key = V_combo, value = combined) %>%
+   select(id, paste0("V", seq(1, ncol(.)-1, 1)))
+   },
+   {
+   out <- ped[1]
+   new_cols <- paste0("V", seq(1, (ncol(ped)-1)/2))
+   
+   out[new_cols] <- lapply(seq(2, ncol(ped)-1, 2), 
+                           function(i) do.call(paste0, ped[i:(i+1)]))
+   },
+   times = 1
+   )

Unit: seconds                                                                                                                                                                                                                                                                                                                                                                                                                                               

   expr           min        lq      mean    median        uq       max neval
camille     250.30901 250.30901 250.30901 250.30901 250.30901 250.30901     1
akrun       23.52434  23.52434  23.52434  23.52434  23.52434  23.52434     1
    > 
    > new <- data.frame(new, stringsAsFactors = FALSE)
    > identical(new, out)
    [1] TRUE

5 个答案:

答案 0 :(得分:2)

我们可以创建一个循环以将列与相邻列子集,paste一起with do.call`并将其作为新列分配给新数据集

out <- df[1]
out[paste0("var", 1:3)] <- lapply(seq(2, ncol(df), 2), 
               function(i) do.call(paste0, df[i:(i+1)]))

答案 1 :(得分:2)

使用tidyverse,您可以提前编写修改表达式,然后将它们全部批量传递给transmute。此解决方案使用列名,因此对列排序具有鲁棒性:如果您对allele列进行混洗,那么仍然可以得到相同的答案。

library( tidyverse )

# Create expressions of the form allele1 = str_c(allele1a, allele1b)
v <- str_c("allele",1:3) %>% set_names %>%
    map( ~glue::glue("str_c({.}a, {.}b)") ) %>% map( rlang::parse_expr )

df %>% transmute( id = id, !!!v )
# # A tibble: 20 x 4
#       id allele1 allele2 allele3
#    <int> <chr>   <chr>   <chr>  
#  1     1 AT      CC      AA     
#  2     2 AT      CC      AA     
#  3     3 AT      CC      AA     
#  4     4 AT      CC      AA     
# ...

我修改了您的数据以使其更符合您的描述

df <- data_frame(id = seq(1,20),
             allele1a = rep("A", 20),
             allele1b = c(rep("T", 10), rep("A", 10)),
             allele2a = rep("C", 20),
             allele2b = c(rep("C", 10), rep("G", 10)),
             allele3a = rep("A", 20),
             allele3b = c(rep("A", 10), rep("G", 10)))

答案 2 :(得分:2)

这是一种tidyverse方式,旨在很好地扩展。您不是要对第1、2、3、4和5、6列进行硬编码,而是将整形为长数据以获取一个变量号,方法是将变量数除以2,将它们成对分组,折叠每对中的字母,然后重新变宽。这样,您可以对任意偶数列执行相同的过程。

library(tidyverse)
...

对ID 1进行过滤以了解其内容:

df %>%
  gather(key = var, value = value, -id) %>%
  mutate(var = str_extract(var, "\\d+") %>% as.numeric()) %>%
  group_by(id) %>%
  mutate(pair = ceiling(var / 2)) %>%
  filter(id == 1)
#> # A tibble: 6 x 4
#> # Groups:   id [1]
#>      id   var value  pair
#>   <int> <dbl> <chr> <dbl>
#> 1     1     1 A         1
#> 2     1     2 T         1
#> 3     1     3 C         2
#> 4     1     4 C         2
#> 5     1     5 A         3
#> 6     1     6 A         3

然后将折叠字符串作为ID和对的每个组合的汇总值:

df %>%
  gather(key = var, value = value, -id) %>%
  mutate(var = str_extract(var, "\\d+") %>% as.numeric()) %>%
  group_by(id) %>%
  mutate(pair = ceiling(var / 2)) %>% 
  group_by(id, pair) %>%
  summarise(combined = paste(value, collapse = ""))
#> # A tibble: 60 x 3
#> # Groups:   id [?]
#>       id  pair combined
#>    <int> <dbl> <chr>   
#>  1     1     1 AT      
#>  2     1     2 CC      
#>  3     1     3 AA      
#>  4     2     1 AT      
#>  5     2     2 CC      
#>  6     2     3 AA      
#>  7     3     1 AT      
#>  8     3     2 CC      
#>  9     3     3 AA      
#> 10     4     1 AT      
#> # ... with 50 more rows

并使用spread返回宽格式。

df %>%
  gather(key = var, value = value, -id) %>%
  mutate(var = str_extract(var, "\\d+") %>% as.numeric()) %>%
  group_by(id) %>%
  mutate(pair = ceiling(var / 2)) %>% 
  group_by(id, pair) %>%
  summarise(combined = paste(value, collapse = "")) %>%
  mutate(var_combo = paste0("var", pair)) %>%
  select(-pair) %>%
  spread(key = var_combo, value = combined) %>%
  head()
#> # A tibble: 6 x 4
#> # Groups:   id [6]
#>      id var1  var2  var3 
#>   <int> <chr> <chr> <chr>
#> 1     1 AT    CC    AA   
#> 2     2 AT    CC    AA   
#> 3     3 AT    CC    AA   
#> 4     4 AT    CC    AA   
#> 5     5 AT    CC    AA   
#> 6     6 AT    CC    AA

reprex package(v0.2.1)于2018-11-07创建

答案 3 :(得分:2)

使用base r您可以:

 a <- seq(2,ncol(df),2)
 b <- paste0(unlist(df[a]),unlist(df[a+1]))
 d <- data.frame(matrix(b,nrow(df)))
 result <- cbind(df[1],d)

这也可以写成一行:

(dat =  data.frame(matrix(paste0(unlist(df[a<-seq(2,ncol(df),2)]),unlist(df[a+1])),nrow(df))))
   X1 X2 X3
1  AT CC AA
2  AT CC AA
3  AT CC AA
4  AT CC AA
5  AT CC AA
6  AT CC AA
7  AT CC AA
8  AT CC AA
9  AT CC AA
10 AT CC AA
11 AA CG AG
12 AA CG AG
13 AA CG AG
14 AA CG AG
15 AA CG AG
16 AA CG AG
17 AA CG AG
18 AA CG AG
19 AA CG AG
20 AA CG AG

然后将其与id列绑定:

cbind(df[1],dat)

答案 4 :(得分:0)

df <- data.frame(id = seq(1,20),
                 var1 = rep("A", 20),
                 var2 = c(rep("T", 10), rep("A", 10)),
                 var3 = rep("C", 20),
                 var4 = c(rep("C", 10), rep("G", 10)),
                 var5 = rep("A", 20),
                 var6 = c(rep("A", 10), rep("G", 10)),
                 stringsAsFactors = FALSE)

df2 <- data.frame(id = df[,1], var1 = paste(df[,2], df[,3], sep = ""), 
                  var2 = paste(df[,4], df[,5], sep = ""), 
                  var3 = paste(df[,6], df[,7], sep = ""))