Question

我正在将一堆数据（14M节点，460M边缘）加载到neo4j数据库中，并使用BatchInserter来实现性能目的。我在两个过程中加载数据：第一个节点，然后是边缘，使用BatchInserterIndex在添加边时查询nodeID。

每个节点都有两个属性：

名称
型

名称不是唯一的，但名称+类型是。这意味着我无法使用get(String key, Obj value)查询索引;所以我使用的是query(Object query)，但记录很少。我从Ruby文档中抄袭，看起来查询对象应该是Lucene查询。

但是，当我查询name:"thename" type:"thetype"时，我会找回数据库中所有节点的列表。

如果所有其他方法都失败了，我可以添加第三个属性“nametype”，只是为了拥有批量插入的唯一ID，但是如果我不需要，我宁愿不这样做。知道发生了什么事吗？

段：

// the load-nodes phase:
BatchInserter inserter = BatchInserters.inserter(dbDir);
Map<String, Object> properties = new HashMap<String, Object>();
BatchInserterIndexProvider indexProvider = 
    new LuceneBatchInserterIndexProvider( inserter );
BatchInserterIndex nodes = 
    indexProvider.nodeIndex( NODEINDEX, MapUtil.stringMap( "type", "exact" ) );

// for file in filelist
    // all nodes in a file have the same type
    properties.put( NODETYPE_KEY, types.get(file) );
    // for line in file:
        properties.put( NODENAME_KEY, line );
        long node = inserter.createNode( properties );
        nodes.add(node, properties);
    // \for
// \for

// ...

// the load-edges phase:
BatchInserter inserter = BatchInserters.inserter(dbDir);
BatchInserterIndexProvider indexProvider = 
    new LuceneBatchInserterIndexProvider( inserter );
BatchInserterIndex nodes = 
    indexProvider.nodeIndex( NODEINDEX, MapUtil.stringMap( "type", "exact" ) );
nodes.setCacheCapacity( NODENAME_KEY, cache );

// for line in file
    String fromType = fromTypes.get(file);
    String fromName = parseFromName(line);
    String query = String.format("%s:\"%s\" %s:\"%s\"",
        NODETYPE_KEY,fromType,NODENAME_KEY,fromName);
    IndexHits<Long> froms = nodes.query(query);
    // froms has #nodes results ?!
// \for

Answer 1

Aaaaaaaaaaa在Lucene中的默认连接是“OR”。： - /

我明确地做了它并且它有效。

此外，我尝试了第三键串联类型和名称替代方案。在这种情况下，看起来index.get（key，val）的速度大约是index.query（lucene_expression）的两倍，而构造和存储额外属性会使节点加载速度降低约50％。由于我的数据集的关系是节点的40倍，因此将额外的属性添加到每个节点实际上是有意义的。 YMMV。

为什么BatchInserterIndex会对返回所有节点的多个字段进行精确查询？

1 个答案: