我想使用软件包kernlab中的stringdot函数通过其SVM函数ksvm对DNA序列进行分类。问题是,stringdot似乎本身不支持通配符(即“ N”)。是否可以编写自定义版本的stringdot或另一个支持通配符匹配(即'N'匹配所有内容)并且效率相同的内核?
例如...
library(kernlab)
library(Biostrings)
start_time <- Sys.time()
# Make training and validation data
dna <- DNAStringSet( c("CATG", "CATC", "CNNN", "GTAC", "GTAC", "GTAG") )
new <- DNAStringSet( c("CATG", "CATC", "CNNN", "GNNN", "GNNN", "GNNN") )
grp <- factor( c(1, 1, 1, 2, 2, 2) )
# Model and prediction
mod <- ksvm( as.list(dna), grp, type="C-svc", kernel="stringdot",
kpar=list( length=4, type="spectrum" ),
C=5, cross=0, prob.model=F )
pre <- predict( mod, as.list(new) )
# Print stuff
end_time <- Sys.time()
print( end_time - start_time )
print( grp == pre )
应具有100%的预测准确性:
[1] TRUE TRUE TRUE TRUE TRUE TRUE
但是由于将Ns视为不匹配,因此得出:
[1] TRUE TRUE TRUE FALSE FALSE FALSE
有没有办法解决这个问题并保持类似的效率?
Time difference of 0.1171031 secs
谢谢!