我的代码上的错误并改善功能-按查询拆分

时间:2019-02-14 17:25:06

标签: r dplyr tidyverse

我有这个示例数据框

nucleotide  start  end    strand  block_id  query  pid       
AE002161.1  5537   6724   1       1         0      AAF73616.1
AE002161.1  6714   7727   1       1         0      AAF37902.1
AE002161.1  7687   10839  -1      1         1      AAF37903.1
AE002161.1  10826  13900  -1      1         0      AAF37904.1
AE002161.1  13887  15596  1       1         0      AAF37905.1
AE002161.1  18606  19487  -1      2         0      AAF37910.1
AE002161.1  19822  19998  -1      2         0      AAF37911.1
AE002161.1  19982  21625  1       2         1      AAF37912.1
AE002161.1  21728  22996  1       2         0      AAF37913.1
AE002161.1  23108  25063  1       2         0      AAF37914.1
AE002161.1  36276  36575  -1      3         0      AAF37924.1
AE002161.1  36680  38116  -1      3         0      AAF37925.1
AE002161.1  38120  39928  -1      3         1      AAF37926.1
AE002161.1  40478  41497  1       3         0      AAF37927.1
AE002161.1  41864  42256  1       3         0      AAF37928.1
AE002161.1  45880  46554  1       4         0      AAF37933.1
AE002161.1  46556  47884  1       4         0      AAF37934.1
AE002161.1  47902  48408  1       4         1      AAF37935.1
AE002161.1  48412  49254  1       4         1      AAF37936.1
AE002161.1  49264  50379  1       4         0      AAF73618.1
AE002161.1  50395  51903  1       4         0      AAF73619.1

和此功能

library(tidyverse)

splitq <- function(data){
  a <- data %>%
    mutate(., block_id = group_indices(., nucleotide, block_id) ) %>%
    group_by(nucleotide, block_id) %>%
    mutate(old=cumsum(query)) %>%
    mutate( query = ifelse( old > 1 , 0,  query ) ) %>%
    ungroup()

  a_max <- max(a$block_id)

  b <- data %>%
    arrange( desc(row_number() ) ) %>%
    mutate(., block_id = group_indices(., nucleotide, block_id) + a_max ) %>%
    group_by(nucleotide, block_id) %>%
    mutate(old=cumsum(query)) %>%
    mutate( query = ifelse( old > 1 , 0,  query ) ) %>%
    ungroup() %>%
    bind_rows(a) %>%
    select(-old)
}

当我运行此功能时,我会得到结果

nucleotide  start  end    strand  block_id  query  pid         type 
AE002161.1  50395  51903  1       8         0      AAF73619.1  CDS  
AE002161.1  49264  50379  1       8         0      AAF73618.1  CDS  
AE002161.1  48412  49254  1       8         1      AAF37936.1  CDS  
AE002161.1  47902  48408  1       8         0      AAF37935.1  CDS  
AE002161.1  46556  47884  1       8         0      AAF37934.1  CDS  
AE002161.1  45880  46554  1       8         0      AAF37933.1  CDS  
AE002161.1  41864  42256  1       7         0      AAF37928.1  CDS  
AE002161.1  40478  41497  1       7         0      AAF37927.1  CDS  
AE002161.1  38120  39928  -1      7         1      AAF37926.1  CDS  
AE002161.1  36680  38116  -1      7         0      AAF37925.1  CDS  
AE002161.1  36276  36575  -1      7         0      AAF37924.1  CDS  
AE002161.1  23108  25063  1       6         0      AAF37914.1  CDS  
AE002161.1  21728  22996  1       6         0      AAF37913.1  CDS  
AE002161.1  19982  21625  1       6         1      AAF37912.1  CDS  
AE002161.1  19822  19998  -1      6         0      AAF37911.1  CDS  
AE002161.1  18606  19487  -1      6         0      AAF37910.1  CDS  
AE002161.1  13887  15596  1       5         0      AAF37905.1  CDS  
AE002161.1  10826  13900  -1      5         0      AAF37904.1  CDS  
AE002161.1  7687   10839  -1      5         1      AAF37903.1  CDS  
AE002161.1  6714   7727   1       5         0      AAF37902.1  CDS  
AE002161.1  5537   6724   1       5         0      AAF73616.1  CDS  
AE002161.1  5537   6724   1       1         0      AAF73616.1  CDS  
AE002161.1  6714   7727   1       1         0      AAF37902.1  CDS  
AE002161.1  7687   10839  -1      1         1      AAF37903.1  CDS  
AE002161.1  10826  13900  -1      1         0      AAF37904.1  CDS  
AE002161.1  13887  15596  1       1         0      AAF37905.1  CDS  
AE002161.1  18606  19487  -1      2         0      AAF37910.1  CDS  
AE002161.1  19822  19998  -1      2         0      AAF37911.1  CDS  
AE002161.1  19982  21625  1       2         1      AAF37912.1  CDS  
AE002161.1  21728  22996  1       2         0      AAF37913.1  CDS  
AE002161.1  23108  25063  1       2         0      AAF37914.1  CDS  
AE002161.1  36276  36575  -1      3         0      AAF37924.1  CDS  
AE002161.1  36680  38116  -1      3         0      AAF37925.1  CDS  
AE002161.1  38120  39928  -1      3         1      AAF37926.1  CDS  
AE002161.1  40478  41497  1       3         0      AAF37927.1  CDS  
AE002161.1  41864  42256  1       3         0      AAF37928.1  CDS  
AE002161.1  45880  46554  1       4         0      AAF37933.1  CDS  
AE002161.1  46556  47884  1       4         0      AAF37934.1  CDS  
AE002161.1  47902  48408  1       4         1      AAF37935.1  CDS  
AE002161.1  48412  49254  1       4         0      AAF37936.1  CDS  
AE002161.1  49264  50379  1       4         0      AAF73618.1  CDS  
AE002161.1  50395  51903  1       4         0      AAF73619.1  CDS  

编辑:这似乎不好,因为它会产生一些冗余,因此应该创建5个块而不是8个。

我只想除以query == 1。因此,对于每个查询,我应该在上方有n行,在下方有n行(相同的行,顺序相同)。该操作应由block_id执行。

两个邻居query == 1并排时

AE002161.1  45880  46554  1       4         0      AAF37933.1
AE002161.1  46556  47884  1       4         0      AAF37934.1
AE002161.1  47902  48408  1       4         1      AAF37935.1
AE002161.1  48412  49254  1       4         1      AAF37936.1
AE002161.1  49264  50379  1       4         0      AAF73618.1
AE002161.1  50395  51903  1       4         0      AAF73619.1

它应该返回

AE002161.1  45880  46554  1       4         0      AAF37933.1
AE002161.1  46556  47884  1       4         0      AAF37934.1
AE002161.1  47902  48408  1       4         1      AAF37935.1
AE002161.1  48412  49254  1       4         0      AAF37936.1
AE002161.1  49264  50379  1       4         0      AAF73618.1
AE002161.1  50395  51903  1       4         0      AAF73619.1
AE002161.1  45880  46554  1       5         0      AAF37933.1
AE002161.1  46556  47884  1       5         0      AAF37934.1
AE002161.1  47902  48408  1       5         0      AAF37935.1
AE002161.1  48412  49254  1       5         1      AAF37936.1
AE002161.1  49264  50379  1       5         0      AAF73618.1
AE002161.1  50395  51903  1       5         0      AAF73619.1

这意味着,我不关心是否所有block_id都改变了,因为它每个块都是唯一的(不要在输出中的任何地方重复)。

此外,在此示例中,我只有相同的核苷酸,但可能具有不同的核苷酸。

但是当我对带有2070926行的591MB文件运行此命令时,其中332236是query == 1,其中330409是不同的,我遇到了一些错误。

未生成任何错误消息,但我错过了一些查询。

有人知道发生了什么吗?

预先感谢

1 个答案:

答案 0 :(得分:0)

固定功能

splitq <- function(data){
  a <- data %>% filter(query == 1) %>% mutate( old = block_id, new = row_number()) %>% select(pid, new, old)
  b <- data %>% 
    left_join(a, by = c("block_id" = "old")) %>%
    group_by(new) %>%
    mutate( query = ifelse( pid.x == pid.y, 1, 0), block_id = new ) %>%
    arrange(nucleotide, block_id, start, end) %>%
    select(-pid.y, -new) %>%
    rename(pid=pid.x) %>%
    ungroup()
}