在r中提取dbid的管道列表

时间:2016-03-03 16:42:25

标签: r

r的完整新手。不知何故,我已经设法在我的数据操作中取得了这么远,所以不要有人摧毁我!

我正在寻找从'cat'列中提取数据库ID的管道列表,我正在查询使用'lookup'并将其粘贴到输出文件'out'

lookup<-read.csv("lookup.csv",header=TRUE,stringsAsFactors=FALSE, fill=TRUE)

'lookup.csv'就像这样

activity                verb

gap junction channel    connects  
recombinase             binds  
activator               activates  
binding                 binds
kinase adaptor          binds  
DNA clamp loader        binds
branching               branches  
carboxylase             carboxylates  
nuclease                cleaves  
peptidase               cleaves  
aldolase                cleaves  
heparanase              cleaves  
radical SAM enzyme      cleaves  
endopeptidase           cleaves  
dihydroorotase          cleaves  
N-glycosylase           cleaves  
glycosylase             cleaves  
symporter               co-transports  
cyclase                 converts  
transhydrogenase        converts  
deacetylase             deacetylates  
decarboxylase           decarboxylates  
catalase                decomposes  
dehydratase             dehydrates  

将我的目标文件转储到'cat',其中也有空活动行,我填写NA

cat<-read.csv("molecules3.csv",header=TRUE,
     stringsAsFactors=FALSE, fill=TRUE, na.strings = c(""," ","NA"))

'molecules3.csv'就像这样

dbid    function

6787677 racemase and epimerase activity, acting on carbohydrates and derivatives
6787642 GDP-L-fucose synthase activity
6787632 GDP-mannose 4,6-dehydratase activity
6787623 isomerase activity
6787594 tRNA (adenine-N1-)-methyltransferase activity
6787591 tRNA (guanine-N1-)-methyltransferase activity
6787567 tRNA dimethylallyltransferase activity
6787566 pseudouridine synthase activity
6787540 fucokinase activity
6787533 fucose-1-phosphate guanylyltransferase activity
6787525 tRNA (adenine-N1-)-methyltransferase activity
6787447 tRNA-5-taurinomethyluridine 2-sulfurtransferase
6787403 transferase activity|transferase activity
6787329 phosphopentomutase activity
6787321 deoxyribose-phosphate aldolase activity
6786881 RNA polymerase activity
6786854 tRNA-specific ribonuclease activity|ribonuclease P activity

我将活动和反应类型存储在变量

#Activities are in column 1, reaction type is in column 2
f<- unique(factor(lookup[,1])) 
v<- factor(lookup[,2])

我构建了一个数据框来存放'for'循环的结果

out<-data.frame("Activity"=character(0), "numberActivities"=numeric(0),     "rows"=numeric(0), "DBIDs"=numeric(0), "ReactionType"=character(0),     stringsAsFactors = FALSE)


# loop over each row and determine number of activities for each unique activity
for (i in 1:length(f)){     #for each unique activity
out[i,1]<-as.character(f[i])            #store activity
out[i,2]<-length(grep(f[i],cat[,2]))    # store length of matched activity
out[i,3]<-paste(grep(f[i],cat[,2]),collapse="|") #store position of row matches

out[i,5]<-paste(v[i])               #paste in the corresponding reaction type for each activity

'out'文件就像这样

    Activity    numberActivities rows   DBIDs   ReactionType

101 kinase      1164    6|12|23|24|31…  NA      phosphorylates
114 transferase 892     1|46|48|55|56…  NA      transfers
11  peptidase   483     35|38|51|81|85… NA      cleaves

等等。

要填写[i,4],我想要一个'每个活动类型的'dbid'列表,用'|'分隔。

out [i,3]就是这样,但却填满了行号。

如何使用dbid列表填充第4列?

有人可以帮忙吗?

1 个答案:

答案 0 :(得分:0)

不要使用cat作为对象名称。它是一个基本的R函数名称。这是l/sapply( split(.,.) , FUN) - 范例的(未经测试)实现,用于管理您希望在共享特定特征的数据帧段(本例中的活动)中进行计算的问题:

 temp <- sapply( split( categ, categ$Activities) , 
                 function( db) c(Activities=db$Activities[1], # use the first row as a label
                                 numberActivities= nrow(db), # get the count
                                  rows=paste0(db$Activities,collapse="|") )
         # Have concerns about the length of these results but it's what you asked for
                 )
final<- merge( temp, lookup)

当提供可支持测试的实际样本时,可以给出更好的答案。你的目前还不足以达到这个目的,所以这段代码是未经测试的。