Question

我正在尝试从CSV文件向Neo4j 2.0.3插入一个相对较小的图形（2M关系，几个100K节点）。此文件中的每一行都是一种关系。我正在使用BatchInserter API。

要测试我的代码，我使用输入文件的子集。当此子集为500个关系时，插入运行速度很快（包括JVM启动在内的几秒钟）。当它的1000个关系很大时，导入需要20分钟，结果数据库的大小为130 GB！更奇怪的是，结果（在时间和空间上）与5000关系完全相同。 20分钟中有99％专门用于将GB写入磁盘。

我不明白这里发生了什么。我尝试在the recommendations from the official documentation之后使用各种设置配置插入器。

Files
  .asCharSource(new File("/path/to/input.csv"), Charsets.UTF_8)
  .readLines(new LineProcessor<Void>() {

    BatchInserter inserter = BatchInserters.inserter(
      "/path/to/db", 
      new HashMap<String, String>() {{
        put("dump_configuration","false");
        put("cache_type","none");
        put("use_memory_mapped_buffers","true");
        put("neostore.nodestore.db.mapped_memory","500M");
        put("neostore.relationshipstore.db.mapped_memory","1G");
        put("neostore.propertystore.db.mapped_memory","500M");
        put("neostore.propertystore.db.strings.mapped_memory","500M");
      }}
    );
    RelationshipType relationshipType = 
      DynamicRelationshipType.withName("relationshipType");
    Set<Long> createdNodes = new HashSet<>();

    @Override public boolean processLine(String line) throws IOException {
        String[] components = line.split("\\|");
        long sourceId = parseLong(components[1]);
        long targetId = parseLong(components[3]);

        if (!createdNodes.contains(sourceId)) {
           createdNodes.add(sourceId);
           inserter.createNode(sourceId, new HashMap<>());
        }
        if (!createdNodes.contains(targetId)) {
            createdNodes.add(targetId);
            inserter.createNode(targetId, new HashMap<>());
        }
        inserter.createRelationship(
            sourceNodeId, targetNodeId, relationshipType, new HashMap<>()); 

        return true;
    }

    @Override public Void getResult() {
        inserter.shutdown();
        return null;
    }

});

Answer 1

我通过弄乱我的代码偶然发现了解决方案。

事实证明，如果我在没有指定节点ID的情况下调用createNode，那么它的效果非常好。

我正在指定节点ID，因为由于API允许，因此节点ID与输入文件中的ID匹配很方便。

猜测基本原因：节点可能存储在由其ID标记的连续数组中。输入文件中的大多数ID都很小（4位），但有些可能是12位数。因此，当我尝试插入其中一个时，Neo4j会将一个千兆字节长的数组写入磁盘，只是为了将该节点放在最后。也许有人可以证实这一点。令人惊讶的是，这种行为似乎没有在Neo4j API documentation for this method中记录。

Neo4j批量插入器非常慢并且创建了巨大的数据库文件

1 个答案: