如何在Neo4j中有效地创建独特的关系?

时间:2015-05-25 19:28:26

标签: optimization neo4j cypher

跟进我的问题here,我想创建一个关系约束。也就是说,我希望有多个节点共享相同的"邻居"名称,但每个唯一指向它们所在的特定城市。

正如user2194039 answer所鼓励,我使用以下索引:

CREATE INDEX ON :Neighborhood(name)

另外,我有以下约束:

CREATE CONSTRAINT ON (c:City) ASSERT c.name IS UNIQUE;

以下代码无法创建唯一关系,并且需要过长的时间:

USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
WITH line
WHERE line.Neighborhood IS NOT NULL
WITH line
MATCH (c:City { name : line.City})
MERGE (c)<-[:IN]-(n:Neighborhood {name : toInt(line.Neighborhood)});

请注意,City上有唯一性约束,但不是邻域(因为应该有多个)。

限制10,000的个人资料:

+--------------+------+--------+---------------------------+------------------------------+
|     Operator | Rows | DbHits |               Identifiers |                        Other |
+--------------+------+--------+---------------------------+------------------------------+
|  EmptyResult |    0 |      0 |                           |                              |
|  UpdateGraph |    9750 |      3360 | anon[307], b, neighborhood, line |                 MergePattern |
|  SchemaIndex |    9750 |      19500 |                   b, line | line.City; :City(name) |
| ColumnFilter |    9750 |      0 |                      line |            keep columns line |
|       Filter |    9750 |      0 |           anon[220], line |                    anon[220] |
|      Extract |    10000 |      0 |           anon[220], line |                    anon[220] |
|        Slice |    10000 |      0 |                      line |                 {  AUTOINT0} |
|      LoadCSV |    10000 |      0 |                      line |                              |
+--------------+------+--------+---------------------------+------------------------------+

数据库访问总数:22860

按照下面的Guilherme建议,我实现了帮助器,但它引发了错误py2neo.error.Finished。我搜索了文档,但无法确定this的解决方法。它看起来像open SO post about this exception

def run_batch_query(queries, timeout=None):
if timeout:
    http.socket_timeout = timeout
try:
    graph = Graph()
    authenticate("localhost:7474", "account", "password")
    tx = graph.cypher.begin()
    for query in queries:
        statement, params = query
        tx.append(statement, params)
        results = tx.process()
        tx.commit()
except http.SocketError as err:
    raise err
except error.Finished as err:
    raise err
collection = []
for result in results:
    records = []
    for record in result:
        records.append(record)
    collection.append(records)  
return collection

主:

queries = []
template = ["MERGE (city:City {Name:{city}})", "Merge (city)<-[:IN]-(n:Neighborhood {Name : {neighborhood}})"]
statement = '\n'.join(template)
batch = 5000
c = 1
start = time.time()

# city_neighborhood_map is a defaultdict that maps city-> set of neighborhoods
for city, neighborhoods in city_neighborhood_map.iteritems():
    for neighborhood in neighborhoods:
        params = dict(city=city, neighborhood=neighborhood)
        queries.append((statement, params))
        c +=1
        if c % batch == 0:
            print "running batch"
            print c
            s = time.time()*1000
            r = run_batch_query(queries, 10)
            e = time.time()*1000
            print("\t{0}, {1:.00f}ms".format(c, e-s))
            del queries[:]

print c
if queries:
    s = time.time()*1000 
    r = run_batch_query(queries, 300)
    e = time.time()*1000
    print("\t{0} {1:.00f}ms".format(c, e-s))
end = time.time()
print("End. {0}s".format(end-start))

2 个答案:

答案 0 :(得分:1)

如果您想创建独特的关系,您有两个选择:

  1. 使用MERGE防止路径被复制,就像@ user2194039建议的那样。我认为这是你能采取的最简单,最好的方法。

  2. 将您的关系转变为节点,并在其上创建唯一约束。但对于大多数情况来说,这几乎是不必要的。

  3. 如果您遇到速度问题,请尝试使用事务端点。我尝试通过2.2.1中的IMPORT CSV导入您的数据(随机城市和社区),我也很慢,但我不确定原因。如果您将带有参数的查询以1000-5000的批量发送到事务端点,则可以监视该过程,并可能获得性能提升。 我设法在不到11分钟的时间内导入了1M行。

    我使用了INDEX for Neighborhood(名称)和City(名称)的唯一约束。 试一试,看看它是否适合你。

    编辑:

    事务端点是一个restful端点,允许您批量执行事务。你可以阅读它here。 基本上,它允许您立即将一堆查询流式传输到服务器。

    我不知道你正在使用什么编程语言/堆栈,但在python中,使用像py2neo这样的包,它会是这样的:

    with open("city.csv", "r") as fp:
    
        reader = csv.reader(fp)
    
        queries = []
        template = ["MERGE (c :`City` {name: {city}})",
                    "MERGE (c)<-[:IN]-(n :`Neighborhood` {name: {neighborhood}})"]
    
        statement = '\n'.join(template)
    
        batch = 5000
    
        c = 1
    
        start = time.time()
    
        for row in reader:
    
            city, neighborhood = row
    
            params = dict(city=city, neighborhood=neighborhood)
    
            queries.append((statement, params))
    
            if c % batch == 0:
    
                s = time.time()*1000
                r = neo4j.run_batch_query(queries, 10)
                e = time.time()*1000
                print("\t{0}, {1:.00f}ms".format(c, e-s))
                del queries[:]
    
            c += 1
    
        if queries:
    
            s = time.time()*1000
            r = neo4j.run_batch_query(queries, 300)
            e = time.time()*1000
            print("\t{0} {1:.00f}ms".format(c, e-s))
    
        end = time.time()
    
        print("End. {0}s".format(end-start))
    

    助手功能:

    def run_batch_query(queries, timeout=None):
    
        if timeout:
            http.socket_timeout = timeout
    
        try:
            graph = Graph(uri) # "{protocol}://{host}:{port}/db/data/"
            tx = graph.cypher.begin()
    
            for query in queries:
                statement, params = query
    
                tx.append(statement, params)
    
            results = tx.process()
    
            tx.commit()
    
        except http.SocketError as err:
            raise err
    
        collection = []
        for result in results:
    
            records = []
    
            for record in result:
    
                records.append(record)
    
            collection.append(records)
    
        return collection
    

    您将监控每笔交易的持续时间,并且您可以调整每笔交易的查询次数以及超时。

答案 1 :(得分:0)

为了确保我们在同一页面上,这就是我理解你的模型的方式:每个城市都是独一无二的,应该有一些指向它的社区。社区在城市环境中是独一无二的,但不是全球性的。因此,如果您有一个社区3 [IN]城市Boston,您还可以拥有一个3 [IN]城市Seattle,并且邻域由不同的节点表示,即使它们具有相同的name属性。那是对的吗?

在导入之前,我建议您为邻居节点添加索引。您可以添加索引而不强制执行唯一性。我发现这个非常可以提高甚至小型数据库的速度。

CREATE INDEX ON :Neighborhood(name)

导入:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
MERGE (c:City {name: line.City})
MERGE (c)<-[:IN]-(n:Neighborhood {name: toInt(line.Neighborhood)})

如果要导入大量数据,最好在导入时使用USING PERIODIC COMMIT命令定期提交。这将减少进程中使用的内存,如果您的服务器受内存限制,我可以看到它有助于提高性能。在你的情况下,有近百万条记录,这是Neo4j推荐的。您甚至可以通过USING PERIODIC COMMIT 10000等来调整提交的频率。 docs表示1000是默认值。请理解,这会导致导入多个事务。

祝你好运!