Question

我正在使用GTF格式的生物序列数据。以下是格式的简单示例：

start   stop   type         name 
1       90     exon         transcript_1_exon_1
12      15     start_codon  transcript_1_exon_1
100     160    exon         transcript_1_exon_2
190     250    exon         transcript_1_exon_3
217     220    stop_codon   transcript_1_exon_3

我正在尝试将外显子转换为蛋白质序列。然而，外显子的某些部分不是蛋白质编码。这表示存在一行，type字段设置为start_codon或stop_codon。

我想分别将这些功能的开始和停止移动到它们自己的列中，如下所示：

start   stop  type         name                 start_codon  stop_codon
1       90    exon         transcript_1_exon_1  12           NA
100     160   exon         transcript_1_exon_2  NA           NA
190     250   exon         transcript_1_exon_3  NA           220

但是，我无法弄清楚如何在R中这样做。我最近使用dplyr的是：

gtf3 <- gtf2 %>% group_by(feature_name) %>% summarise(
  start_codon = ifelse(sum(type == "start_codon") != 0, start[type == "start_codon"], NA),
  stop_codon = ifelse(sum(type == "stop_codon") != 0, stop[type == "stop_codon"], NA))

但是这给了我以下错误：Evaluation error: object of type 'closure' is not subsettable.

如何将开始/结束密码子的起点和终点分别移动到它们自己的列中？

Answer 1

这是一种方法：

df1 %>% filter(type=="exon") %>%
  left_join(df1 %>% 
              filter(type=="start_codon") %>% 
              select(-type,-stop),by="name",suffix = c("","_codon")) %>%
  left_join(df1 %>%  
              filter(type=="stop_codon") %>% 
              select(-type,-start),by="name",suffix = c("","_codon"))

#   start stop type                name start_codon stop_codon
# 1     1   90 exon transcript_1_exon_1          12         NA
# 2   100  160 exon transcript_1_exon_2          NA         NA
# 3   190  250 exon transcript_1_exon_3          NA        220

dplyr：折叠可能不存在的行

1 个答案: