Question

我有一个名为＆＃39;基因＆＃39;有4300行看起来像

Gene_id  chr  start   stop              
GeneA chr1  10  1000                 
GeneB chr1  2300  7000                     
GeneC chr1 10000 13577

和另一个名为＆＃39; base＆＃39; （约100,000行）看起来像

Chr Bases          
chr1 160           
chr1 157             
chr1 8500           
chr1 2200

我想生成一个文件，使每个基因的基数保持在开始和停止之间的范围

所以输出看起来像

Chr Bases             
chr1 160            
chr1 157

我尝试过这个功能，但它只给了我四次第一个条目：

methC <- apply(bases,1,function(a){
my_bases <- bases[bases[1]==genes$chr & bases[2]>=genes$start & bases[2]<=genes$stop,]
result <- my_bases[,2]
return(result)
})

>methC
# 160 160 160

所以我错过了基础157和160重复了4次。

如果我使用

b <- bases[which(bases[1]==genes$chr & bases[2]>=genes$start & bases[2]<=genes$stop),]
> b
 #  Chr Bases
#chr1   160

我仍然缺少157，但也许这是因为订单。

但是，如果我尝试使用我的真实和更大的文件，那么＆＃39;函数我得到一个空的data.frame

> b
Chr        Base       
<0 rows> (or 0-length row.names)

，这就是为什么我认为函数处理大型数据集会更好。

Answer 1

我会使用数据表库：

library("data.table")

# Read the data from file
genes <- data.table(read.csv("genes.csv"))
bases <- data.table(read.csv("bases.csv"))

# Define the columns that are used to join the two tables
setkey(genes, chr)
setkey(bases, Chr)

# The first line joins the two tables, and the second line
# filters the result to keep only those that match and defines
# the columns that are to be shown in the output. Note the need
# to take care with capitalisation and the reserved word 'stop'
genes[bases, allow.cartesian = TRUE][
  Bases >= start & Bases <= `stop`, list(Chr = chr, Bases)]

产生这个结果：

    Chr Bases
1: chr1   160
2: chr1   157

循环遍历R中的文件并根据范围选择值

1 个答案: