跟进我的问题here,我想创建一个关系约束。也就是说,我希望有多个节点共享相同的"邻居"名称,但每个唯一指向它们所在的特定城市。
正如user2194039 answer所鼓励,我使用以下索引:
CREATE INDEX ON :Neighborhood(name)
另外,我有以下约束:
CREATE CONSTRAINT ON (c:City) ASSERT c.name IS UNIQUE;
以下代码无法创建唯一关系,并且需要过长的时间:
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
WITH line
WHERE line.Neighborhood IS NOT NULL
WITH line
MATCH (c:City { name : line.City})
MERGE (c)<-[:IN]-(n:Neighborhood {name : toInt(line.Neighborhood)});
请注意,City上有唯一性约束,但不是邻域(因为应该有多个)。
限制10,000的个人资料:
+--------------+------+--------+---------------------------+------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+--------------+------+--------+---------------------------+------------------------------+
| EmptyResult | 0 | 0 | | |
| UpdateGraph | 9750 | 3360 | anon[307], b, neighborhood, line | MergePattern |
| SchemaIndex | 9750 | 19500 | b, line | line.City; :City(name) |
| ColumnFilter | 9750 | 0 | line | keep columns line |
| Filter | 9750 | 0 | anon[220], line | anon[220] |
| Extract | 10000 | 0 | anon[220], line | anon[220] |
| Slice | 10000 | 0 | line | { AUTOINT0} |
| LoadCSV | 10000 | 0 | line | |
+--------------+------+--------+---------------------------+------------------------------+
数据库访问总数:22860
按照下面的Guilherme建议,我实现了帮助器,但它引发了错误py2neo.error.Finished。我搜索了文档,但无法确定this的解决方法。它看起来像open SO post about this exception。
def run_batch_query(queries, timeout=None):
if timeout:
http.socket_timeout = timeout
try:
graph = Graph()
authenticate("localhost:7474", "account", "password")
tx = graph.cypher.begin()
for query in queries:
statement, params = query
tx.append(statement, params)
results = tx.process()
tx.commit()
except http.SocketError as err:
raise err
except error.Finished as err:
raise err
collection = []
for result in results:
records = []
for record in result:
records.append(record)
collection.append(records)
return collection
主:
queries = []
template = ["MERGE (city:City {Name:{city}})", "Merge (city)<-[:IN]-(n:Neighborhood {Name : {neighborhood}})"]
statement = '\n'.join(template)
batch = 5000
c = 1
start = time.time()
# city_neighborhood_map is a defaultdict that maps city-> set of neighborhoods
for city, neighborhoods in city_neighborhood_map.iteritems():
for neighborhood in neighborhoods:
params = dict(city=city, neighborhood=neighborhood)
queries.append((statement, params))
c +=1
if c % batch == 0:
print "running batch"
print c
s = time.time()*1000
r = run_batch_query(queries, 10)
e = time.time()*1000
print("\t{0}, {1:.00f}ms".format(c, e-s))
del queries[:]
print c
if queries:
s = time.time()*1000
r = run_batch_query(queries, 300)
e = time.time()*1000
print("\t{0} {1:.00f}ms".format(c, e-s))
end = time.time()
print("End. {0}s".format(end-start))
答案 0 :(得分:1)
如果您想创建独特的关系,您有两个选择:
使用MERGE防止路径被复制,就像@ user2194039建议的那样。我认为这是你能采取的最简单,最好的方法。
将您的关系转变为节点,并在其上创建唯一约束。但对于大多数情况来说,这几乎是不必要的。
如果您遇到速度问题,请尝试使用事务端点。我尝试通过2.2.1中的IMPORT CSV导入您的数据(随机城市和社区),我也很慢,但我不确定原因。如果您将带有参数的查询以1000-5000的批量发送到事务端点,则可以监视该过程,并可能获得性能提升。 我设法在不到11分钟的时间内导入了1M行。
我使用了INDEX for Neighborhood(名称)和City(名称)的唯一约束。 试一试,看看它是否适合你。
编辑:
事务端点是一个restful端点,允许您批量执行事务。你可以阅读它here。 基本上,它允许您立即将一堆查询流式传输到服务器。
我不知道你正在使用什么编程语言/堆栈,但在python中,使用像py2neo这样的包,它会是这样的:
with open("city.csv", "r") as fp:
reader = csv.reader(fp)
queries = []
template = ["MERGE (c :`City` {name: {city}})",
"MERGE (c)<-[:IN]-(n :`Neighborhood` {name: {neighborhood}})"]
statement = '\n'.join(template)
batch = 5000
c = 1
start = time.time()
for row in reader:
city, neighborhood = row
params = dict(city=city, neighborhood=neighborhood)
queries.append((statement, params))
if c % batch == 0:
s = time.time()*1000
r = neo4j.run_batch_query(queries, 10)
e = time.time()*1000
print("\t{0}, {1:.00f}ms".format(c, e-s))
del queries[:]
c += 1
if queries:
s = time.time()*1000
r = neo4j.run_batch_query(queries, 300)
e = time.time()*1000
print("\t{0} {1:.00f}ms".format(c, e-s))
end = time.time()
print("End. {0}s".format(end-start))
助手功能:
def run_batch_query(queries, timeout=None):
if timeout:
http.socket_timeout = timeout
try:
graph = Graph(uri) # "{protocol}://{host}:{port}/db/data/"
tx = graph.cypher.begin()
for query in queries:
statement, params = query
tx.append(statement, params)
results = tx.process()
tx.commit()
except http.SocketError as err:
raise err
collection = []
for result in results:
records = []
for record in result:
records.append(record)
collection.append(records)
return collection
您将监控每笔交易的持续时间,并且您可以调整每笔交易的查询次数以及超时。
答案 1 :(得分:0)
为了确保我们在同一页面上,这就是我理解你的模型的方式:每个城市都是独一无二的,应该有一些指向它的社区。社区在城市环境中是独一无二的,但不是全球性的。因此,如果您有一个社区3
[IN]
城市Boston
,您还可以拥有一个3
[IN]
城市Seattle
,并且邻域由不同的节点表示,即使它们具有相同的name
属性。那是对的吗?
在导入之前,我建议您为邻居节点添加索引。您可以添加索引而不强制执行唯一性。我发现这个非常可以提高甚至小型数据库的速度。
CREATE INDEX ON :Neighborhood(name)
导入:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
MERGE (c:City {name: line.City})
MERGE (c)<-[:IN]-(n:Neighborhood {name: toInt(line.Neighborhood)})
如果要导入大量数据,最好在导入时使用USING PERIODIC COMMIT
命令定期提交。这将减少进程中使用的内存,如果您的服务器受内存限制,我可以看到它有助于提高性能。在你的情况下,有近百万条记录,这是Neo4j推荐的。您甚至可以通过USING PERIODIC COMMIT 10000
等来调整提交的频率。 docs表示1000是默认值。请理解,这会导致导入多个事务。
祝你好运!