r的完整新手。不知何故,我已经设法在我的数据操作中取得了这么远,所以不要有人摧毁我!
我正在寻找从'cat'列中提取数据库ID的管道列表,我正在查询使用'lookup'并将其粘贴到输出文件'out'
lookup<-read.csv("lookup.csv",header=TRUE,stringsAsFactors=FALSE, fill=TRUE)
'lookup.csv'就像这样
activity verb
gap junction channel connects
recombinase binds
activator activates
binding binds
kinase adaptor binds
DNA clamp loader binds
branching branches
carboxylase carboxylates
nuclease cleaves
peptidase cleaves
aldolase cleaves
heparanase cleaves
radical SAM enzyme cleaves
endopeptidase cleaves
dihydroorotase cleaves
N-glycosylase cleaves
glycosylase cleaves
symporter co-transports
cyclase converts
transhydrogenase converts
deacetylase deacetylates
decarboxylase decarboxylates
catalase decomposes
dehydratase dehydrates
将我的目标文件转储到'cat',其中也有空活动行,我填写NA
cat<-read.csv("molecules3.csv",header=TRUE,
stringsAsFactors=FALSE, fill=TRUE, na.strings = c(""," ","NA"))
'molecules3.csv'就像这样
dbid function
6787677 racemase and epimerase activity, acting on carbohydrates and derivatives
6787642 GDP-L-fucose synthase activity
6787632 GDP-mannose 4,6-dehydratase activity
6787623 isomerase activity
6787594 tRNA (adenine-N1-)-methyltransferase activity
6787591 tRNA (guanine-N1-)-methyltransferase activity
6787567 tRNA dimethylallyltransferase activity
6787566 pseudouridine synthase activity
6787540 fucokinase activity
6787533 fucose-1-phosphate guanylyltransferase activity
6787525 tRNA (adenine-N1-)-methyltransferase activity
6787447 tRNA-5-taurinomethyluridine 2-sulfurtransferase
6787403 transferase activity|transferase activity
6787329 phosphopentomutase activity
6787321 deoxyribose-phosphate aldolase activity
6786881 RNA polymerase activity
6786854 tRNA-specific ribonuclease activity|ribonuclease P activity
我将活动和反应类型存储在变量
中#Activities are in column 1, reaction type is in column 2
f<- unique(factor(lookup[,1]))
v<- factor(lookup[,2])
我构建了一个数据框来存放'for'循环的结果
out<-data.frame("Activity"=character(0), "numberActivities"=numeric(0), "rows"=numeric(0), "DBIDs"=numeric(0), "ReactionType"=character(0), stringsAsFactors = FALSE)
# loop over each row and determine number of activities for each unique activity
for (i in 1:length(f)){ #for each unique activity
out[i,1]<-as.character(f[i]) #store activity
out[i,2]<-length(grep(f[i],cat[,2])) # store length of matched activity
out[i,3]<-paste(grep(f[i],cat[,2]),collapse="|") #store position of row matches
out[i,5]<-paste(v[i]) #paste in the corresponding reaction type for each activity
'out'文件就像这样
Activity numberActivities rows DBIDs ReactionType
101 kinase 1164 6|12|23|24|31… NA phosphorylates
114 transferase 892 1|46|48|55|56… NA transfers
11 peptidase 483 35|38|51|81|85… NA cleaves
等等。
要填写[i,4],我想要一个'每个活动类型的'dbid'列表,用'|'分隔。
out [i,3]就是这样,但却填满了行号。
如何使用dbid列表填充第4列?
有人可以帮忙吗?
答案 0 :(得分:0)
不要使用cat
作为对象名称。它是一个基本的R函数名称。这是l/sapply( split(.,.) , FUN)
- 范例的(未经测试)实现,用于管理您希望在共享特定特征的数据帧段(本例中的活动)中进行计算的问题:
temp <- sapply( split( categ, categ$Activities) ,
function( db) c(Activities=db$Activities[1], # use the first row as a label
numberActivities= nrow(db), # get the count
rows=paste0(db$Activities,collapse="|") )
# Have concerns about the length of these results but it's what you asked for
)
final<- merge( temp, lookup)
当提供可支持测试的实际样本时,可以给出更好的答案。你的目前还不足以达到这个目的,所以这段代码是未经测试的。