我有这个示例数据框
nucleotide start end strand block_id query pid
AE002161.1 5537 6724 1 1 0 AAF73616.1
AE002161.1 6714 7727 1 1 0 AAF37902.1
AE002161.1 7687 10839 -1 1 1 AAF37903.1
AE002161.1 10826 13900 -1 1 0 AAF37904.1
AE002161.1 13887 15596 1 1 0 AAF37905.1
AE002161.1 18606 19487 -1 2 0 AAF37910.1
AE002161.1 19822 19998 -1 2 0 AAF37911.1
AE002161.1 19982 21625 1 2 1 AAF37912.1
AE002161.1 21728 22996 1 2 0 AAF37913.1
AE002161.1 23108 25063 1 2 0 AAF37914.1
AE002161.1 36276 36575 -1 3 0 AAF37924.1
AE002161.1 36680 38116 -1 3 0 AAF37925.1
AE002161.1 38120 39928 -1 3 1 AAF37926.1
AE002161.1 40478 41497 1 3 0 AAF37927.1
AE002161.1 41864 42256 1 3 0 AAF37928.1
AE002161.1 45880 46554 1 4 0 AAF37933.1
AE002161.1 46556 47884 1 4 0 AAF37934.1
AE002161.1 47902 48408 1 4 1 AAF37935.1
AE002161.1 48412 49254 1 4 1 AAF37936.1
AE002161.1 49264 50379 1 4 0 AAF73618.1
AE002161.1 50395 51903 1 4 0 AAF73619.1
和此功能
library(tidyverse)
splitq <- function(data){
a <- data %>%
mutate(., block_id = group_indices(., nucleotide, block_id) ) %>%
group_by(nucleotide, block_id) %>%
mutate(old=cumsum(query)) %>%
mutate( query = ifelse( old > 1 , 0, query ) ) %>%
ungroup()
a_max <- max(a$block_id)
b <- data %>%
arrange( desc(row_number() ) ) %>%
mutate(., block_id = group_indices(., nucleotide, block_id) + a_max ) %>%
group_by(nucleotide, block_id) %>%
mutate(old=cumsum(query)) %>%
mutate( query = ifelse( old > 1 , 0, query ) ) %>%
ungroup() %>%
bind_rows(a) %>%
select(-old)
}
当我运行此功能时,我会得到结果
nucleotide start end strand block_id query pid type
AE002161.1 50395 51903 1 8 0 AAF73619.1 CDS
AE002161.1 49264 50379 1 8 0 AAF73618.1 CDS
AE002161.1 48412 49254 1 8 1 AAF37936.1 CDS
AE002161.1 47902 48408 1 8 0 AAF37935.1 CDS
AE002161.1 46556 47884 1 8 0 AAF37934.1 CDS
AE002161.1 45880 46554 1 8 0 AAF37933.1 CDS
AE002161.1 41864 42256 1 7 0 AAF37928.1 CDS
AE002161.1 40478 41497 1 7 0 AAF37927.1 CDS
AE002161.1 38120 39928 -1 7 1 AAF37926.1 CDS
AE002161.1 36680 38116 -1 7 0 AAF37925.1 CDS
AE002161.1 36276 36575 -1 7 0 AAF37924.1 CDS
AE002161.1 23108 25063 1 6 0 AAF37914.1 CDS
AE002161.1 21728 22996 1 6 0 AAF37913.1 CDS
AE002161.1 19982 21625 1 6 1 AAF37912.1 CDS
AE002161.1 19822 19998 -1 6 0 AAF37911.1 CDS
AE002161.1 18606 19487 -1 6 0 AAF37910.1 CDS
AE002161.1 13887 15596 1 5 0 AAF37905.1 CDS
AE002161.1 10826 13900 -1 5 0 AAF37904.1 CDS
AE002161.1 7687 10839 -1 5 1 AAF37903.1 CDS
AE002161.1 6714 7727 1 5 0 AAF37902.1 CDS
AE002161.1 5537 6724 1 5 0 AAF73616.1 CDS
AE002161.1 5537 6724 1 1 0 AAF73616.1 CDS
AE002161.1 6714 7727 1 1 0 AAF37902.1 CDS
AE002161.1 7687 10839 -1 1 1 AAF37903.1 CDS
AE002161.1 10826 13900 -1 1 0 AAF37904.1 CDS
AE002161.1 13887 15596 1 1 0 AAF37905.1 CDS
AE002161.1 18606 19487 -1 2 0 AAF37910.1 CDS
AE002161.1 19822 19998 -1 2 0 AAF37911.1 CDS
AE002161.1 19982 21625 1 2 1 AAF37912.1 CDS
AE002161.1 21728 22996 1 2 0 AAF37913.1 CDS
AE002161.1 23108 25063 1 2 0 AAF37914.1 CDS
AE002161.1 36276 36575 -1 3 0 AAF37924.1 CDS
AE002161.1 36680 38116 -1 3 0 AAF37925.1 CDS
AE002161.1 38120 39928 -1 3 1 AAF37926.1 CDS
AE002161.1 40478 41497 1 3 0 AAF37927.1 CDS
AE002161.1 41864 42256 1 3 0 AAF37928.1 CDS
AE002161.1 45880 46554 1 4 0 AAF37933.1 CDS
AE002161.1 46556 47884 1 4 0 AAF37934.1 CDS
AE002161.1 47902 48408 1 4 1 AAF37935.1 CDS
AE002161.1 48412 49254 1 4 0 AAF37936.1 CDS
AE002161.1 49264 50379 1 4 0 AAF73618.1 CDS
AE002161.1 50395 51903 1 4 0 AAF73619.1 CDS
编辑:这似乎不好,因为它会产生一些冗余,因此应该创建5个块而不是8个。
我只想除以query == 1
。因此,对于每个查询,我应该在上方有n行,在下方有n行(相同的行,顺序相同)。该操作应由block_id执行。
两个邻居query == 1
并排时
AE002161.1 45880 46554 1 4 0 AAF37933.1
AE002161.1 46556 47884 1 4 0 AAF37934.1
AE002161.1 47902 48408 1 4 1 AAF37935.1
AE002161.1 48412 49254 1 4 1 AAF37936.1
AE002161.1 49264 50379 1 4 0 AAF73618.1
AE002161.1 50395 51903 1 4 0 AAF73619.1
它应该返回
AE002161.1 45880 46554 1 4 0 AAF37933.1
AE002161.1 46556 47884 1 4 0 AAF37934.1
AE002161.1 47902 48408 1 4 1 AAF37935.1
AE002161.1 48412 49254 1 4 0 AAF37936.1
AE002161.1 49264 50379 1 4 0 AAF73618.1
AE002161.1 50395 51903 1 4 0 AAF73619.1
AE002161.1 45880 46554 1 5 0 AAF37933.1
AE002161.1 46556 47884 1 5 0 AAF37934.1
AE002161.1 47902 48408 1 5 0 AAF37935.1
AE002161.1 48412 49254 1 5 1 AAF37936.1
AE002161.1 49264 50379 1 5 0 AAF73618.1
AE002161.1 50395 51903 1 5 0 AAF73619.1
这意味着,我不关心是否所有block_id都改变了,因为它每个块都是唯一的(不要在输出中的任何地方重复)。
此外,在此示例中,我只有相同的核苷酸,但可能具有不同的核苷酸。
但是当我对带有2070926行的591MB文件运行此命令时,其中332236是query == 1
,其中330409是不同的,我遇到了一些错误。
未生成任何错误消息,但我错过了一些查询。
有人知道发生了什么吗?
预先感谢
答案 0 :(得分:0)
固定功能
splitq <- function(data){
a <- data %>% filter(query == 1) %>% mutate( old = block_id, new = row_number()) %>% select(pid, new, old)
b <- data %>%
left_join(a, by = c("block_id" = "old")) %>%
group_by(new) %>%
mutate( query = ifelse( pid.x == pid.y, 1, 0), block_id = new ) %>%
arrange(nucleotide, block_id, start, end) %>%
select(-pid.y, -new) %>%
rename(pid=pid.x) %>%
ungroup()
}