Question

我正在尝试对Titan，OrientDB和Neo4j三个不同的图表数据库进行基准测试。我想测量数据库创建的执行时间。作为测试用例，我使用此数据集http://snap.stanford.edu/data/web-flickr.html。虽然数据存储在本地，而不是存储在计算机内存中，但我注意到它消耗了大量内存，不幸的是，在一段时间内eclipse崩溃了。为什么会这样？

以下是一些代码段：泰坦图创作

public long createGraphDB(String datasetRoot, TitanGraph titanGraph) {
    long duration;
    long startTime = System.nanoTime();
    try {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetRoot)));
        String line;
        int lineCounter = 1;
        while((line = reader.readLine()) != null) {
            if(lineCounter > 4) {
                String[] parts = line.split(" ");
                Vertex srcVertex = titanGraph.addVertex(null);
                srcVertex.setProperty( "nodeId", parts[0] );
                Vertex dstVertex = titanGraph.addVertex(null);
                dstVertex.setProperty( "nodeId", parts[1] );
                Edge edge = titanGraph.addEdge(null, srcVertex, dstVertex, "similar");
                titanGraph.commit();
            }
            lineCounter++;
        }
        reader.close();
    }
    catch(IOException ioe) {
        ioe.printStackTrace();
    }
    catch( Exception e ) {    
        titanGraph.rollback();
    }
    long endTime = System.nanoTime();
    duration = endTime - startTime;
    return duration;
}

OrientDB图表创建：

public long createGraphDB(String datasetRoot, OrientGraph orientGraph) {
    long duration;
    long startTime = System.nanoTime();
    try {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetRoot)));
        String line;
        int lineCounter = 1;    
        while((line = reader.readLine()) != null) {
            if(lineCounter > 4) {
                String[] parts = line.split(" ");
                Vertex srcVertex = orientGraph.addVertex(null);
                srcVertex.setProperty( "nodeId", parts[0] );
                Vertex dstVertex = orientGraph.addVertex(null);
                dstVertex.setProperty( "nodeId", parts[1] );
                Edge edge = orientGraph.addEdge(null, srcVertex, dstVertex, "similar");
                orientGraph.commit();
            }
            lineCounter++;
        }
        reader.close();
    }
    catch(IOException ioe) {
        ioe.printStackTrace();
    }
    catch( Exception e ) {    
        orientGraph.rollback();
    }
    long endTime = System.nanoTime();
    duration = endTime - startTime;
    return duration;

Neo4j图表创建：

public long createDB(String datasetRoot, GraphDatabaseService neo4jGraph) {
    long duration;
    long startTime = System.nanoTime(); 
    Transaction tx = neo4jGraph.beginTx();
    try {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetRoot)));
        String line;
        int lineCounter = 1;
        while((line = reader.readLine()) != null) {
            if(lineCounter > 4) {
                String[] parts = line.split(" ");
                Node srcNode = neo4jGraph.createNode();
                srcNode.setProperty("nodeId", parts[0]);
                Node dstNode = neo4jGraph.createNode();
                dstNode.setProperty("nodeId", parts[1]);
                Relationship relationship = srcNode.createRelationshipTo(dstNode, RelTypes.SIMILAR);
            }
            lineCounter++;
        }
        tx.success();
        reader.close();
    } 
    catch (IOException e) {
        e.printStackTrace();
    }
    finally {
        tx.finish();
    }
    long endTime = System.nanoTime();
    duration = endTime - startTime;
    return duration;
}

编辑：我尝试了BatchGraph解决方案，它似乎将永远运行。它昨天整夜运行，它永远不会结束。我不得不阻止它。我的代码有什么问题吗？

TitanGraph graph = TitanFactory.open("data/titan");
    BatchGraph<TitanGraph> batchGraph = new BatchGraph<TitanGraph>(graph, VertexIDType.STRING, 1000);
    try {
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("data/flickrEdges.txt")));
        String line;
        int lineCounter = 1;
        while((line = reader.readLine()) != null) {
            if(lineCounter > 4) {
                String[] parts = line.split(" ");
                Vertex srcVertex = batchGraph.getVertex(parts[0]);
                if(srcVertex == null) {
                    srcVertex = batchGraph.addVertex(parts[0]);
                }
                Vertex dstVertex = batchGraph.getVertex(parts[1]);
                if(dstVertex == null) {
                    dstVertex = batchGraph.addVertex(parts[1]);
                }
                Edge edge = batchGraph.addEdge(null, srcVertex, dstVertex, "similar");
                batchGraph.commit();
            }
            lineCounter++;
        }
        reader.close();
    }

Answer 1

在您尝试比较多个数据库时，我建议您将代码概括为蓝图。对于像BatchGraph图形包装器这样的东西，Flickr数据集看起来像是正确的大小。使用BatchGraph，您可以调整提交大小，并专注于管理加载的代码。通过这种方式，您可以使用一个简单的类来加载所有不同的图形（您甚至可以轻松地将测试扩展到其他支持蓝图的图形）。

@Stefan对内存提出了一个很好的观点...你可能需要在JVM上增加-Xmx设置来处理这些数据。每个Graph处理内存的方式不同（即使它们持久保存到磁盘），如果你在同一个JVM中同时加载所有三个，我可以打赌那里有一些争用。

如果您计划比您引用的Flickr数据集更大，那么BatchGraph可能不正确。 BatchGraph通常对几亿个图元素有好处。当你开始谈论大于那个的图形时，你可能想要忘记我所说的关于尝试非图形特定的一些内容。您可能希望为要测试的每个图形使用最佳工具。对于Neo4j，这意味着Neo4jBatchGraph（至少那种方式，如果这对你来说很重要，你仍然使用蓝图），对于Titan来说，这意味着Faunus或自定义编写的并行批量加载器和OrientDB {{3} }

Answer 2

使用OrientDB，您可以通过两种方式优化此导入：

使用自定义扩展程序和
完全避免使用交易

所以使用OrientGraphNoTx而不是OrientGraph打开图表，然后尝试这个片段：

OrientVertex srcVertex = orientGraph.addVertex(null, "nodeId", parts[0] );
OrientVertex dstVertex = orientGraph.addVertex(null, "nodeId", parts[1] );
Edge edge = orientGraph.addEdge(null, srcVertex, dstVertex, "similar");

不调用.commit（）。

图数据库的内存问题

2 个答案: