我正在处理遗传数据,因此需要连接成对的列。我拥有的数据在单独的列中具有主要和次要等位基因(例如,等位基因1a,等位基因1b,等位基因2a,等位基因2b等)。我需要一种将整个数据框的列对成对的方法。我在下面提供了一个示例,但是我的数据有170万对(因此我现在有340万列),因此如果我需要为每列命名,它将无法工作。稍后我将更改列名称。如果有一种方法可以在R。I have tried to create a sequence and paste them中完成,则可以得到任何指导,例如:
df <- data.frame(id = seq(1,20),
var1 = rep("A", 20),
var2 = c(rep("T", 10), rep("A", 10)),
var3 = rep("C", 20),
var4 = c(rep("C", 10), rep("G", 10)),
var5 = rep("A", 20),
var6 = c(rep("A", 10), rep("G", 10)),
stringsAsFactors = FALSE)
i <- seq.int(1, length(ped), by = 2L)
df <- paste0(df[i], df[i+1])
但是没有用。我希望它来自:
id var1 var2 var3 var4 var5 var6
1 1 A T C C A A
2 2 A T C C A A
3 3 A T C C A A
4 4 A T C C A A
5 5 A T C C A A
6 6 A T C C A A
7 7 A T C C A A
8 8 A T C C A A
9 9 A T C C A A
10 10 A T C C A A
11 11 A A C G A G
12 12 A A C G A G
13 13 A A C G A G
14 14 A A C G A G
15 15 A A C G A G
16 16 A A C G A G
17 17 A A C G A G
18 18 A A C G A G
19 19 A A C G A G
20 20 A A C G A G
收件人:
id var1 var2 var3
1 1 AT CC AA
2 2 AT CC AA
3 3 AT CC AA
4 4 AT CC AA
5 5 AT CC AA
6 6 AT CC AA
7 7 AT CC AA
8 8 AT CC AA
9 9 AT CC AA
10 10 AT CC AA
11 11 AA CG AG
12 12 AA CG AG
13 13 AA CG AG
14 14 AA CG AG
15 15 AA CG AG
16 16 AA CG AG
17 17 AA CG AG
18 18 AA CG AG
19 19 AA CG AG
20 20 AA CG AG
编辑: 谢谢!!!我能够为数据修改两个答案,而@akrun的运行速度更快。我创建了一个包含100行和100,000列的数据子集,结果如下:
microbenchmark(
+ {
+ new <- ped %>%
+ gather(key = V, value = value, -id) %>%
+ mutate(V = str_extract(V, "\\d+") %>% as.numeric()) %>%
+ group_by(id) %>%
+ mutate(pair = ceiling(V / 2)) %>%
+ group_by(id, pair) %>%
+ summarise(combined = paste(value, collapse = "")) %>%
+ mutate(V_combo = paste0("V", pair)) %>%
+ select(-pair) %>%
+ spread(key = V_combo, value = combined) %>%
+ select(id, paste0("V", seq(1, ncol(.)-1, 1)))
+ },
+ {
+ out <- ped[1]
+ new_cols <- paste0("V", seq(1, (ncol(ped)-1)/2))
+
+ out[new_cols] <- lapply(seq(2, ncol(ped)-1, 2),
+ function(i) do.call(paste0, ped[i:(i+1)]))
+ },
+ times = 1
+ )
Unit: seconds
expr min lq mean median uq max neval
camille 250.30901 250.30901 250.30901 250.30901 250.30901 250.30901 1
akrun 23.52434 23.52434 23.52434 23.52434 23.52434 23.52434 1
>
> new <- data.frame(new, stringsAsFactors = FALSE)
> identical(new, out)
[1] TRUE
答案 0 :(得分:2)
我们可以创建一个循环以将列与相邻列子集,paste
一起with
do.call`并将其作为新列分配给新数据集
out <- df[1]
out[paste0("var", 1:3)] <- lapply(seq(2, ncol(df), 2),
function(i) do.call(paste0, df[i:(i+1)]))
答案 1 :(得分:2)
使用tidyverse
,您可以提前编写修改表达式,然后将它们全部批量传递给transmute
。此解决方案使用列名,因此对列排序具有鲁棒性:如果您对allele
列进行混洗,那么仍然可以得到相同的答案。
library( tidyverse )
# Create expressions of the form allele1 = str_c(allele1a, allele1b)
v <- str_c("allele",1:3) %>% set_names %>%
map( ~glue::glue("str_c({.}a, {.}b)") ) %>% map( rlang::parse_expr )
df %>% transmute( id = id, !!!v )
# # A tibble: 20 x 4
# id allele1 allele2 allele3
# <int> <chr> <chr> <chr>
# 1 1 AT CC AA
# 2 2 AT CC AA
# 3 3 AT CC AA
# 4 4 AT CC AA
# ...
我修改了您的数据以使其更符合您的描述
df <- data_frame(id = seq(1,20),
allele1a = rep("A", 20),
allele1b = c(rep("T", 10), rep("A", 10)),
allele2a = rep("C", 20),
allele2b = c(rep("C", 10), rep("G", 10)),
allele3a = rep("A", 20),
allele3b = c(rep("A", 10), rep("G", 10)))
答案 2 :(得分:2)
这是一种tidyverse
方式,旨在很好地扩展。您不是要对第1、2、3、4和5、6列进行硬编码,而是将整形为长数据以获取一个变量号,方法是将变量数除以2,将它们成对分组,折叠每对中的字母,然后重新变宽。这样,您可以对任意偶数列执行相同的过程。
library(tidyverse)
...
对ID 1进行过滤以了解其内容:
df %>%
gather(key = var, value = value, -id) %>%
mutate(var = str_extract(var, "\\d+") %>% as.numeric()) %>%
group_by(id) %>%
mutate(pair = ceiling(var / 2)) %>%
filter(id == 1)
#> # A tibble: 6 x 4
#> # Groups: id [1]
#> id var value pair
#> <int> <dbl> <chr> <dbl>
#> 1 1 1 A 1
#> 2 1 2 T 1
#> 3 1 3 C 2
#> 4 1 4 C 2
#> 5 1 5 A 3
#> 6 1 6 A 3
然后将折叠字符串作为ID和对的每个组合的汇总值:
df %>%
gather(key = var, value = value, -id) %>%
mutate(var = str_extract(var, "\\d+") %>% as.numeric()) %>%
group_by(id) %>%
mutate(pair = ceiling(var / 2)) %>%
group_by(id, pair) %>%
summarise(combined = paste(value, collapse = ""))
#> # A tibble: 60 x 3
#> # Groups: id [?]
#> id pair combined
#> <int> <dbl> <chr>
#> 1 1 1 AT
#> 2 1 2 CC
#> 3 1 3 AA
#> 4 2 1 AT
#> 5 2 2 CC
#> 6 2 3 AA
#> 7 3 1 AT
#> 8 3 2 CC
#> 9 3 3 AA
#> 10 4 1 AT
#> # ... with 50 more rows
并使用spread
返回宽格式。
df %>%
gather(key = var, value = value, -id) %>%
mutate(var = str_extract(var, "\\d+") %>% as.numeric()) %>%
group_by(id) %>%
mutate(pair = ceiling(var / 2)) %>%
group_by(id, pair) %>%
summarise(combined = paste(value, collapse = "")) %>%
mutate(var_combo = paste0("var", pair)) %>%
select(-pair) %>%
spread(key = var_combo, value = combined) %>%
head()
#> # A tibble: 6 x 4
#> # Groups: id [6]
#> id var1 var2 var3
#> <int> <chr> <chr> <chr>
#> 1 1 AT CC AA
#> 2 2 AT CC AA
#> 3 3 AT CC AA
#> 4 4 AT CC AA
#> 5 5 AT CC AA
#> 6 6 AT CC AA
由reprex package(v0.2.1)于2018-11-07创建
答案 3 :(得分:2)
使用base r您可以:
a <- seq(2,ncol(df),2)
b <- paste0(unlist(df[a]),unlist(df[a+1]))
d <- data.frame(matrix(b,nrow(df)))
result <- cbind(df[1],d)
这也可以写成一行:
(dat = data.frame(matrix(paste0(unlist(df[a<-seq(2,ncol(df),2)]),unlist(df[a+1])),nrow(df))))
X1 X2 X3
1 AT CC AA
2 AT CC AA
3 AT CC AA
4 AT CC AA
5 AT CC AA
6 AT CC AA
7 AT CC AA
8 AT CC AA
9 AT CC AA
10 AT CC AA
11 AA CG AG
12 AA CG AG
13 AA CG AG
14 AA CG AG
15 AA CG AG
16 AA CG AG
17 AA CG AG
18 AA CG AG
19 AA CG AG
20 AA CG AG
然后将其与id列绑定:
cbind(df[1],dat)
答案 4 :(得分:0)
df <- data.frame(id = seq(1,20),
var1 = rep("A", 20),
var2 = c(rep("T", 10), rep("A", 10)),
var3 = rep("C", 20),
var4 = c(rep("C", 10), rep("G", 10)),
var5 = rep("A", 20),
var6 = c(rep("A", 10), rep("G", 10)),
stringsAsFactors = FALSE)
df2 <- data.frame(id = df[,1], var1 = paste(df[,2], df[,3], sep = ""),
var2 = paste(df[,4], df[,5], sep = ""),
var3 = paste(df[,6], df[,7], sep = ""))