Question

我有一个非常基本的索引，包含2个字段 - 一个数字ID字段和一个~60到100个字符的长字符串字段。字符串字段包含DNA序列，没有空格。

例如，给定的字段值为：AATCTAGATACGAGATCGATCGATCGATCGATCGATCGATGCTAGC

和搜索字符串类似于：GATCGATCGA

有超过700万行，索引大约为1GB。

我将索引存储在azure blob存储中，并运行一个简单的Web应用程序来查询B1 Web应用程序实例上的索引。

无论我做什么，无论搜索字符串的大小如何，我都无法让操作以超过20-21秒的速度运行。

我已经尝试扩展到B3实例，但仍然在20秒内进入。

我已经隔离了瓶颈到针对IndexSearcher运行查询时。

要搜索，我会在搜索字符串的开头和结尾附加一个通配符。

我的代码如下

    var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
    var parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_30, "nucleotide", new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30));
    parser.AllowLeadingWildcard = true;
    var storageAccount = CloudStorageAccount.Parse("connection info");
    var azureDir = new AzureDirectory(storageAccount, "myindex", new RAMDirectory());
    IndexSearcher searcher = new IndexSearcher(azureDir, true);
var query = parser.Parse("*" + mystring + "*");    
TopDocs hits = searcher.Search(query, 50);

Answer 1

根本不是Lucene的答案一旦你看到它只是评论，我会删除它但这将是一个SQL解决方案

Table   
int32    seqID   
tinyint  pos   
char(1)  value   

with the first two as a composite PK

然后你就建立了查询

select distinct t1.seqID 
  from table t1 
  join table t2 
          on t2.seqID = t1.seqID 
         and t2.pos   = t1.pos + 1  
         and t1.val   = 'val1'
         and t2.val   = 'val2' 
  join table t3 
          on t3.seqID = t1.seqID 
         and t3.pos   = t1.pos + 2  
         and t3.val   = 'val3' 
  join table t4 
          on t4.seqID = t1.seqID 
         and t4.pos   = t1.pos + 3  
         and t4.val   = 'val4'  
   ...

我知道这可能看起来很疯狂，但SQL有一个所有这些连接的索引，应该提前过滤（不要触及所有行）。是的，它接触所有行，但是通过char的char char，并且一旦char by char失败就应该放弃。

就像我在评论中说的那样，我也会尝试使用强力正则表达式，但我怀疑它会超过350 / ms，因为它必须触及所有行。并且你不能像我在评论中所说的那样映射到字节，因为正则表达式是文本搜索

对于其他选项是一个DNA类，它在内部使用一个字节数组，使用Like方法进行字节比较，但我也怀疑它会超过350 / ms。
byte array pattern match

无法解决在Azure上运行的Lucene.net搜索中的瓶颈

1 个答案: