我正在尝试向Neo4j插入大约200万个节点并且性能有问题。
我正在使用neo4j enterprise 2.2.0和用java编写的服务器扩展。我的电脑有一台ssd,32GB RAM,Intel Core i7 cpu,运行的是Windows 8.我运行一个独立版本的服务器,然后在bin文件夹中运行Neo4j.bat启动它。
现在插入10 000个没有关系的节点大约需要25秒(我需要稍后添加关系,但当时有一个问题)。
我认为这是一个配置问题,所以我玩了一些设置,但性能没有变化。我觉得奇怪的是,即使我在neo4j-wrapper.conf中将initmemory和maxmemory设置为15000,java进程也只分配3gb。
我在下面附上了我的代码和配置,有没有人知道我做错了什么?插入大图时我应该期待什么性能?
for (Thing t : things) {
List<ValuePair> properties = parseThing(t);
String uid = createUid(t);
try (Transaction tx = graphDb.beginTx()) {
Node node = graphDb.createNode();
node.setProperty("uid", uid);
for (ValuePair vp : properties) {
node.setProperty(vp.getName(), vp.getValue());
}
tx.success();
}
}
(首先我在创建节点时添加了DynamicLabel,但它甚至更慢。如果你想在插入节点时获得良好的性能,是否可以使用标签?)
neo4j.properties
################################################################
# Neo4j
#
# neo4j.properties - database tuning parameters
#
################################################################
# Enable this to be able to upgrade a store from an older version.
#allow_store_upgrade=true
# The amount of memory to use for mapping the store files, in bytes (or
# kilobytes with the 'k' suffix, megabytes with 'm' and gigabytes with 'g').
# If Neo4j is running on a dedicated server, then it is generally recommended
# to leave about 2-4 gigabytes for the operating system, give the JVM enough
# heap to hold all your transaction state and query context, and then leave the
# rest for the page cache.
# The default page cache memory assumes the machine is dedicated to running
# Neo4j, and is heuristically set to 75% of RAM minus the max Java heap size.
dbms.pagecache.memory=4g
# Enable this to specify a parser other than the default one.
#cypher_parser_version=2.0
# Keep logical logs, helps debugging but uses more disk space, enabled for
# legacy reasons To limit space needed to store historical logs use values such
# as: "7 days" or "100M size" instead of "true".
#keep_logical_logs=7 days
# Autoindexing
# Enable auto-indexing for nodes, default is false.
#node_auto_indexing=true
# The node property keys to be auto-indexed, if enabled.
#node_keys_indexable=name,age
# Enable auto-indexing for relationships, default is false.
#relationship_auto_indexing=true
# The relationship property keys to be auto-indexed, if enabled.
#relationship_keys_indexable=name,age
# Enable shell server so that remote clients can connect via Neo4j shell.
#remote_shell_enabled=true
# The network interface IP the shell will listen on (use 0.0.0 for all interfaces).
#remote_shell_host=127.0.0.1
# The port the shell will listen on, default is 1337.
#remote_shell_port=1337
# The type of cache to use for nodes and relationships.
cache_type=hpc
cache.memory_ratio=70
# Maximum size of the heap memory to dedicate to the cached nodes.
node_cache_size=2g
#relationship_cache_size=6g
# Maximum size of the heap memory to dedicate to the cached relationships.
#relationship_cache_size=
# Enable online backups to be taken from this database.
online_backup_enabled=true
# Port to listen to for incoming backup requests.
online_backup_server=127.0.0.1:6362
# Uncomment and specify these lines for running Neo4j in High Availability mode.
# See the High availability setup tutorial for more details on these settings
# http://neo4j.com/docs/2.2.0/ha-setup-tutorial.html
# ha.server_id is the number of each instance in the HA cluster. It should be
# an integer (e.g. 1), and should be unique for each cluster instance.
#ha.server_id=
# ha.initial_hosts is a comma-separated list (without spaces) of the host:port
# where the ha.cluster_server of all instances will be listening. Typically
# this will be the same for all cluster instances.
#ha.initial_hosts=192.168.0.1:5001,192.168.0.2:5001,192.168.0.3:5001
# IP and port for this instance to listen on, for communicating cluster status
# information iwth other instances (also see ha.initial_hosts). The IP
# must be the configured IP address for one of the local interfaces.
#ha.cluster_server=192.168.0.1:5001
# IP and port for this instance to listen on, for communicating transaction
# data with other instances (also see ha.initial_hosts). The IP
# must be the configured IP address for one of the local interfaces.
#ha.server=192.168.0.1:6001
# The interval at which slaves will pull updates from the master. Comment out
# the option to disable periodic pulling of updates. Unit is seconds.
ha.pull_interval=10
# Amount of slaves the master will try to push a transaction to upon commit
# (default is 1). The master will optimistically continue and not fail the
# transaction even if it fails to reach the push factor. Setting this to 0 will
# increase write performance when writing through master but could potentially
# lead to branched data (or loss of transaction) if the master goes down.
#ha.tx_push_factor=1
# Strategy the master will use when pushing data to slaves (if the push factor
# is greater than 0). There are two options available "fixed" (default) or
# "round_robin". Fixed will start by pushing to slaves ordered by server id
# (highest first) improving performance since the slaves only have to cache up
# one transaction at a time.
#ha.tx_push_strategy=fixed
# Policy for how to handle branched data.
#branched_data_policy=keep_all
# Clustering timeouts
# Default timeout.
#ha.default_timeout=5s
# How often heartbeat messages should be sent. Defaults to ha.default_timeout.
#ha.heartbeat_interval=5s
# Timeout for heartbeats between cluster members. Should be at least twice that of ha.heartbeat_interval.
#heartbeat_timeout=11s
neo4j-server.properties
################################################################
# Neo4j
#
# neo4j-server.properties - runtime operational settings
#
################################################################
#***************************************************************
# Server configuration
#***************************************************************
# location of the database directory
org.neo4j.server.database.location=data/graph.db
# Low-level graph engine tuning file
org.neo4j.server.db.tuning.properties=conf/neo4j.properties
# Database mode
# Allowed values:
# HA - High Availability
# SINGLE - Single mode, default.
# To run in High Availability mode, configure the neo4j.properties config file, then uncomment this line:
#org.neo4j.server.database.mode=HA
# Let the webserver only listen on the specified IP. Default is localhost (only
# accept local connections). Uncomment to allow any connection. Please see the
# security section in the neo4j manual before modifying this.
#org.neo4j.server.webserver.address=0.0.0.0
# Require (or disable the requirement of) auth to access Neo4j
dbms.security.auth_enabled=true
#
# HTTP Connector
#
# http port (for all data, administrative, and UI access)
org.neo4j.server.webserver.port=7474
#
# HTTPS Connector
#
# Turn https-support on/off
org.neo4j.server.webserver.https.enabled=true
# https port (for all data, administrative, and UI access)
org.neo4j.server.webserver.https.port=7473
# Certificate location (auto generated if the file does not exist)
org.neo4j.server.webserver.https.cert.location=conf/ssl/snakeoil.cert
# Private key location (auto generated if the file does not exist)
org.neo4j.server.webserver.https.key.location=conf/ssl/snakeoil.key
# Internally generated keystore (don't try to put your own
# keystore there, it will get deleted when the server starts)
org.neo4j.server.webserver.https.keystore.location=data/keystore
# Comma separated list of JAX-RS packages containing JAX-RS resources, one
# package name for each mountpoint. The listed package names will be loaded
# under the mountpoints specified. Uncomment this line to mount the
# org.neo4j.examples.server.unmanaged.HelloWorldResource.java from
# neo4j-server-examples under /examples/unmanaged, resulting in a final URL of
# http://localhost:7474/examples/unmanaged/helloworld/{nodeId}
#org.neo4j.server.thirdparty_jaxrs_classes=org.neo4j.examples.server.unmanaged=/examples/unmanaged
org.neo4j.server.thirdparty_jaxrs_classes=my.project.package=/mypath
#*****************************************************************
# HTTP logging configuration
#*****************************************************************
# HTTP logging is disabled. HTTP logging can be enabled by setting this
# property to 'true'.
org.neo4j.server.http.log.enabled=false
# Logging policy file that governs how HTTP log output is presented and
# archived. Note: changing the rollover and retention policy is sensible, but
# changing the output format is less so, since it is configured to use the
# ubiquitous common log format
org.neo4j.server.http.log.config=conf/neo4j-http-logging.xml
#*****************************************************************
# Administration client configuration
#*****************************************************************
# location of the servers round-robin database directory. possible values:
# - absolute path like /var/rrd
# - path relative to the server working directory like data/rrd
# - commented out, will default to the database data directory.
org.neo4j.server.webadmin.rrdb.location=data/rrd
的Neo4j-wrapper.conf
#********************************************************************
# Property file references
#********************************************************************
wrapper.java.additional=-Dorg.neo4j.server.properties=conf/neo4j-server.properties
wrapper.java.additional=-Djava.util.logging.config.file=conf/logging.properties
wrapper.java.additional=-Dlog4j.configuration=file:conf/log4j.properties
#********************************************************************
# JVM Parameters
#********************************************************************
wrapper.java.additional.1=-XX:+UseConcMarkSweepGC
wrapper.java.additional.2=-XX:+CMSClassUnloadingEnabled
wrapper.java.additional.3=-XX:-OmitStackTraceInFastThrow
wrapper.java.additional.4=-XX:hashCode=5
# Remote JMX monitoring, uncomment and adjust the following lines as needed.
# Also make sure to update the jmx.access and jmx.password files with appropriate permission roles and passwords,
# the shipped configuration contains only a read only role called 'monitor' with password 'Neo4j'.
# For more details, see: http://download.oracle.com/javase/7/docs/technotes/guides/management/agent.html
# On Unix based systems the jmx.password file needs to be owned by the user that will run the server,
# and have permissions set to 0600.
# For details on setting these file permissions on Windows see:
# http://docs.oracle.com/javase/7/docs/technotes/guides/management/security-windows.html
#wrapper.java.additional=-Dcom.sun.management.jmxremote.port=3637
#wrapper.java.additional=-Dcom.sun.management.jmxremote.authenticate=true
#wrapper.java.additional=-Dcom.sun.management.jmxremote.ssl=false
#wrapper.java.additional=-Dcom.sun.management.jmxremote.password.file=conf/jmx.password
#wrapper.java.additional=-Dcom.sun.management.jmxremote.access.file=conf/jmx.access
# Some systems cannot discover host name automatically, and need this line configured:
#wrapper.java.additional=-Djava.rmi.server.hostname=$THE_NEO4J_SERVER_HOSTNAME
# Uncomment the following lines to enable garbage collection logging
#wrapper.java.additional=-Xloggc:data/log/neo4j-gc.log
#wrapper.java.additional=-XX:+PrintGCDetails
#wrapper.java.additional=-XX:+PrintGCDateStamps
#wrapper.java.additional=-XX:+PrintGCApplicationStoppedTime
#wrapper.java.additional=-XX:+PrintPromotionFailure
#wrapper.java.additional=-XX:+PrintTenuringDistribution
# Java Heap Size: by default the Java heap size is dynamically
# calculated based on available system resources.
# Uncomment these lines to set specific initial and maximum
# heap size in MB.
wrapper.java.initmemory=15000
wrapper.java.maxmemory=15000
#********************************************************************
# Wrapper settings
#********************************************************************
# path is relative to the bin dir
wrapper.pidfile=../data/neo4j-server.pid
#********************************************************************
# Wrapper Windows NT/2000/XP Service Properties
#********************************************************************
# WARNING - Do not modify any of these properties when an application
# using this configuration file has been installed as a service.
# Please uninstall the service before modifying this section. The
# service can then be reinstalled.
# Name of the service
wrapper.name=neo4j
# User account to be used for linux installs. Will default to current
# user if not set.
wrapper.user=
#********************************************************************
# Other Neo4j system properties
#********************************************************************
wrapper.java.additional=-Dneo4j.ext.udc.source=zip
wrapper.java.additional=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 -Xdebug-Xnoagent-Djava.compiler=NONE-Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005
如果你帮我解决这个问题,你会很开心!
答案 0 :(得分:2)
您需要在事务中创建多个节点,否则事务开销会占用大部分时间。
请这样试试:
try (Transaction tx = graphDb.beginTx()) {
for (Thing t : things) {
List<ValuePair> properties = parseThing(t);
String uid = createUid(t);
Node node = graphDb.createNode();
node.setProperty("uid", uid);
for (ValuePair vp : properties) {
node.setProperty(vp.getName(), vp.getValue());
}
}
tx.success();
}
答案 1 :(得分:2)
非常感谢Christian Morgner和Michael Hunger指出我正确的方向!
解决方案是拆分列表并进行较小的事务并使用线程。首先,我添加所有节点,然后添加所有关系。您可以使用批量大小,我想最好的取决于您的图表的样子。
这是我的代码(简化):
主要强>
public static final int CPU = Runtime.getRuntime().availableProcessors()*2;
public static final int BATCH_NODES = 100_000;
public static final int BATCH_RELATIONS = 50_000;
ExecutorService pool = createPool(CPU, CPU * 25);
for(int i = 0; i < things.size(); i = i + BATCH_NODES) {
CreateNodeAndRelationRunner nodeRunner;
if(i + BATCH_NODES < things.size()) {
nodeRunner = new CreateNodeRunner(graphDb, things.subList(i, i + BATCH_NODES));
} else {
nodeRunner = new CreateNodeRunner(graphDb, things.subList(i, things.size()));
}
pool.submit(nodeRunner);
}
pool.shutdown();
boolean nodesCreated = false;
try {
nodesCreated = pool.awaitTermination(1, TimeUnit.DAYS);
} catch (InterruptedException e) {
logger.debug("CreateNodeThread was interrupted");
logger.debug(e.getMessage());
}
if(nodesCreated) {
pool = createPool(CPU, CPU * 25);
for(int i = 0; i < things.size(); i=i+ BATCH_RELATIONS) {
CreateRelationsRunner relationsRunner;
if(i+ BATCH_RELATIONS < things.size()) {
relationsRunner = new CreateRelationsRunner(graphDb, things.subList(i, i+ BATCH_RELATIONS));
} else {
relationsRunner = new CreateRelationsRunner(graphDb, things.subList(i, things.size()));
}
pool.submit(relationsRunner);
}
pool.shutdown();
}
<强> CreateNodeRunner.java 强>
public class CreateNodeRunner implements Runnable {
private List<Thing> things;
private GraphDatabaseService graphDb;
public CreateNodeRunner(GraphDatabaseService graphDb, List<Thing> things) {
this.things = things;
this.graphDb = graphDb;
}
@Override
public void run() {
try (Transaction tx = graphDb.beginTx()) {
for(Thing t : things) {
Node node = graphDb.createNode(t.getLabel());
node.setProperty("uid", t.getUid());
for (ValuePair vp : t.getProperties()) {
node.setProperty(vp.getName(), vp.getValue());
}
}
tx.success();
}
}
}
<强> CreateRelationsRunner.java 强>
public class CreateRelationsRunner implements Runnable {
private GraphDatabaseService graphDb;
private List<Thing> things;
public CreateRelationsRunner(GraphDatabaseService graphDb, List<Thing> things) {
this.graphDb = graphDb;
this.things = things;
}
@Override
public void run() {
try (Transaction tx = graphDb.beginTx()) {
for(Thing tFrom : things) {
List<ValuePair> relations = tFrom.getRelations();
Label label = tFrom.getLabel();
Node firstNode = graphDb.findNode(label, "uid", tFrom.getUid());
for(ValuePair vp : relations) {
Thing tTo = (Thing) vp.getValue();
label = tTo.getLabel();
Node secondNode = graphDb.findNode(label, "uid", tTo.getUid());
RelationshipType relType = vp.getRelationshipType();
firstNode.createRelationshipTo(secondNode, relType);
}
}
tx.success();
}
}
}
如果您发现错误或看到可能的改进,请告诉我们。 :)