Question

我想要查询的遗留数据集（ENRON data表示为GraphML）。在相关问题的comment中，@StefanArmbruster建议我使用Cypher查询数据库。我的查询用例很简单：给定一个消息id（Message节点的一个属性），检索具有该id的节点，并检索该消息的发送者和接收者节点。

似乎要在Cypher中执行此操作，我首先必须创建节点的索引。从graphML文件加载数据时，有没有办法自动执行此操作？（我曾使用Gremlin加载数据并创建数据库。）

我还有一个外部Lucene数据索引（我需要它用于其他目的）。有两个索引是否有意义？例如，我可以将Neo4J节点ID索引到我的外部索引中，然后根据这些ID查询图形。我担心的是这些ID的持续存在。（通过类比，Lucene文档ID不应被视为持久性。）

所以，我应该：

在内部索引Neo4j图以使用Cypher查询消息ID？（如果是这样，最好的方法是什么：用一些合适的咒语重新生成数据库以构建索引？在已经存在的数据库上构建索引？）
在我的外部Lucene索引中存储Neo4j节点ID并通过这些存储的ID检索节点？

更新

我一直在尝试使用Gremlin和嵌入式服务器进行自动索引，但没有运气。在documentation中说

底层数据库是自动索引的，请参见第14.12节“自动索引”，以便脚本可以通过索引查找返回导入的节点。

但是当我在加载新数据库后检查图形时，似乎没有索引存在。

Neo4j documentation on auto indexing表示需要进行大量配置。除了设置node_auto_indexing = true之外，您还必须对其进行配置

要实际自动索引某些内容，您必须设置哪些属性应该索引。您可以通过列出索引的属性键来完成此操作上。在配置文件中，使用node_keys_indexable和 relationship_keys_indexable配置键。使用嵌入式时模式，使用GraphDatabaseSettings.node_keys_indexable和 GraphDatabaseSettings.relationship_keys_indexable配置键。在所有情况下，该值应为逗号分隔的属性列表索引的键。

Gremlin应该设置GraphDatabaseSettings参数吗？我尝试将地图传递到Neo4jGraph构造函数中，如下所示：

    Map<String,String> config = [
        'node_auto_indexing':'true',
        'node_keys_indexable': 'emailID'
        ]
    Neo4jGraph g = new Neo4jGraph(graphDB, config);
    g.loadGraphML("../databases/data.graphml");

但这对索引创建没有明显影响。

更新2

我不是通过Gremlin配置数据库，而是使用Neo4j documentation中给出的示例，以便我的数据库创建就像这样（在Groovy中）：

protected Neo4jGraph getGraph(String graphDBname, String databaseName) {
    boolean populateDB = !new File(graphDBName).exists();
    if(populateDB)
        println "creating database";
    else
        println "opening database";

    GraphDatabaseService graphDB = new GraphDatabaseFactory().
        newEmbeddedDatabaseBuilder( graphDBName ).
        setConfig( GraphDatabaseSettings.node_keys_indexable, "emailID" ).
        setConfig( GraphDatabaseSettings.node_auto_indexing, "true" ).
        setConfig( GraphDatabaseSettings.dump_configuration, "true").
        newGraphDatabase();
    Neo4jGraph g = new Neo4jGraph(graphDB);

    if (populateDB) {
        println "Populating graph"
        g.loadGraphML(databaseName);
    }

    return g;
}

我的检索就像这样：

ReadableIndex<Node> autoNodeIndex = graph.rawGraph.index()
    .getNodeAutoIndexer()
    .getAutoIndex();
def node = autoNodeIndex.get( "emailID", "<2614099.1075839927264.JavaMail.evans@thyme>" ).getSingle();

这似乎有效。但请注意，getIndices()对象上的Neo4jGraph调用仍返回空列表。所以结果是我可以正确地运用Neo4j API，但Gremlin包装器似乎无法反映索引状态。表达式g.idx('node_auto_index')（记录在Gremlin Methods中）返回null。

Answer 1

懒惰地创建自动索引。也就是说 - 当您启用自动索引时，首先会在索引第一个属性时创建实际索引。确保在检查索引是否存在之前插入数据，否则可能不会显示。

对于某些自动索引代码（使用编程配置），请参阅例如https://github.com/neo4j-contrib/rabbithole/blob/master/src/test/java/org/neo4j/community/console/IndexTest.java（这与Neo4j 1.8合作

/彼得

Answer 2

您是否尝试过自动索引功能？它基本上是您正在寻找的用例 - 不幸的是，它需要在导入数据之前启用。（否则你必须删除/添加属性才能重新索引它们。）

http://docs.neo4j.org/chunked/milestone/auto-indexing.html

Neo4j索引和遗留数据

2 个答案: