我已从 CSV 文件中导入数据并创建了大量节点,所有这些节点都基于&#在同一数据集中与其他节点相关 34;树数"等级制度:
例如,树编号 A01.111 的节点是Node A01 的直接子节点,节点树编号 A01.111.230 是Node A01.111 的直接子项。
我要做的是是在作为其他节点的直接子节点的节点之间创建唯一关系。例如,节点 A01.111.230 应仅有一个" IS_CHILD_OF"关系,节点 A01.111 。
我尝试了几件事,例如:
MATCH (n:Node), (n2:Node)
WHERE (n2.treeNumber STARTS WITH n.treeNumber)
AND (n <> n2)
AND NOT ((n2)-[:IS_CHILD_OF]->())
CREATE UNIQUE (n2)-[:IS_CHILD_OF]->(n);
此示例导致创建唯一的&#34; IS_CHILD_OF&#34;关系,但不与Node的直接父级。相反,Node A01.111.230 与Node A01 相关。
答案 0 :(得分:3)
我想提出另一个通用的解决方案,同时避免使用@InverseFalcon指出的笛卡尔积。
让我们首先创建一个索引,以便更快地查找,并插入一些测试数据:
CREATE CONSTRAINT ON (n:Node) ASSERT n.treeNumber IS UNIQUE;
CREATE (n:Node {treeNumber: 'A01.111.230'})
CREATE (n:Node {treeNumber: 'A01.111'})
CREATE (n:Node {treeNumber: 'A01'})
然后我们需要扫描所有节点作为潜在父节点,并查找以父节点treeNumber
开头的子节点(STARTS WITH
可以使用索引)并且在节点中没有点treeNumber
的“余数”(即直接孩子),而不是分裂,加入等:
MATCH (p:Node), (c:Node)
WHERE c.treeNumber STARTS WITH p.treeNumber
AND p <> c
AND NOT substring(c.treeNumber, length(p.treeNumber) + 1) CONTAINS '.'
RETURN p, c
我用简单的RETURN
替换了关系的创建以进行性能分析,但您只需将其替换为CREATE UNIQUE
或MERGE
。
实际上,我们可以通过预先计算应该匹配的实际前缀来消除长度上的p <> c
谓词和 + 1
:
MATCH (p:Node)
WITH p, p.treeNumber + '.' AS parentNumber
MATCH (c:Node)
WHERE c.treeNumber STARTS WITH parentNumber
AND NOT substring(c.treeNumber, length(parentNumber)) CONTAINS '.'
RETURN p, c
但是,分析该查询会显示索引未使用,并且 是笛卡尔积(因此我们有 O(n ^ 2) 算法):
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+--------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+--------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +ProduceResults | 2 | 2 | 0 | c, p | p, c |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +Filter | 2 | 2 | 26 | c, p, parentNumber | NOT(Contains(SubstringFunction(c.treeNumber,length(parentNumber),None),{ AUTOSTRING1})) AND StartsWith(c.treeNumber,parentNumber) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +Apply | 2 | 9 | 0 | p, parentNumber -- c | |
| |\ +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| | +NodeByLabelScan | 9 | 9 | 12 | c | :Node |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +Projection | 3 | 3 | 3 | parentNumber -- p | p; Add(p.treeNumber,{ AUTOSTRING0}) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
| +NodeByLabelScan | 3 | 3 | 4 | p | :Node |
+--------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------------------------------------------------+
Total database accesses: 45
但是,如果我们简单地添加一个像这样的提示
MATCH (p:Node)
WITH p, p.treeNumber + '.' AS parentNumber
MATCH (c:Node)
USING INDEX c:Node(treeNumber)
WHERE c.treeNumber STARTS WITH parentNumber
AND NOT substring(c.treeNumber, length(parentNumber)) CONTAINS '.'
RETURN p, c
它确实使用索引,我们有类似 O(n * log(n))算法( log(n)用于索引查找):< / p>
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+-------------------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+-------------------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +ProduceResults | 2 | 2 | 0 | c, p | p, c |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +Filter | 2 | 2 | 6 | c, p, parentNumber | NOT(Contains(SubstringFunction(c.treeNumber,length(parentNumber),None),{ AUTOSTRING1})) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +Apply | 2 | 3 | 0 | p, parentNumber -- c | |
| |\ +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| | +NodeUniqueIndexSeekByRange | 9 | 3 | 6 | c | :Node(treeNumber STARTS WITH parentNumber) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +Projection | 3 | 3 | 3 | parentNumber -- p | p; Add(p.treeNumber,{ AUTOSTRING0}) |
| | +----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
| +NodeByLabelScan | 3 | 3 | 4 | p | :Node |
+-------------------------------+----------------+------+---------+----------------------+------------------------------------------------------------------------------------------+
Total database accesses: 19
请注意,我在介绍先前创建前缀的WITH
步骤时做了一些作弊,因为我注意到它改进了执行计划和数据库访问
MATCH (p:Node), (c:Node)
USING INDEX c:Node(treeNumber)
WHERE c.treeNumber STARTS WITH p.treeNumber
AND p <> c
AND NOT substring(c.treeNumber, length(p.treeNumber) + 1) CONTAINS '.'
RETURN p, c
具有以下执行计划:
Compiler CYPHER 3.0
Planner RULE
Runtime INTERPRETED
+--------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
| Operator | Rows | DB Hits | Variables | Other |
+--------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
| +Filter | 2 | 9 | c, p | NOT(p == c) AND NOT(Contains(SubstringFunction(c.treeNumber,Add(length(p.treeNumber),{ AUTOINT0}),None),{ AUTOSTRING1})) |
| | +------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
| +SchemaIndex | 6 | 12 | c -- p | PrefixSeekRangeExpression(p.treeNumber); :Node(treeNumber) |
| | +------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
| +NodeByLabel | 3 | 4 | p | :Node |
+--------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------+
Total database accesses: 25
最后,为了记录,我写的原始查询的执行计划(即没有提示)是:
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+--------------------+----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+--------------------+----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +ProduceResults | 2 | 2 | 0 | c, p | p, c |
| | +----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +Filter | 2 | 2 | 21 | c, p | NOT(p == c) AND StartsWith(c.treeNumber,p.treeNumber) AND NOT(Contains(SubstringFunction(c.treeNumber,Add(length(p.treeNumber),{ AUTOINT0}),None),{ AUTOSTRING1})) |
| | +----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +CartesianProduct | 9 | 9 | 0 | p -- c | |
| |\ +----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | +NodeByLabelScan | 3 | 9 | 12 | c | :Node |
| | +----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +NodeByLabelScan | 3 | 3 | 4 | p | :Node |
+--------------------+----------------+------+---------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Total database accesses: 37
这不是更糟糕的一个:没有提示但带有预先计算的前缀的那个!这就是你应该总是测量的原因。
答案 1 :(得分:1)
我认为我们可以对查询进行一些改进。首先,确保您具有唯一约束或索引:Node.treeNumber,因为您需要在此查询中改进父节点查找。
接下来,让我们在子节点上进行匹配,不包括根节点(假设root用户的树数没有。)和已经处理过并且已经建立关系的节点。 / p>
然后我们使用索引通过treeNumber找到每个节点的父节点,并创建关系。这假定子treeNumber总是还有4个字符,包括点。
MATCH (child:Node)
WHERE child.treeNumber CONTAINS '.'
AND NOT EXISTS( (child)-[:IS_CHILD_OF]->() )
WITH child, SUBSTRING(child.treeNumber, 0, SIZE(child.treeNumber)-4) as parentNumber
MATCH (parent:Node)
WHERE parent.treeNumber = parentNumber
CREATE UNIQUE (child)-[:IS_CHILD_OF]->(parent)
我认为这个查询避免了笛卡尔积,因为你可以从其他答案得到,并且应该在O(n)附近(如果我错了,有人会纠正我。)
修改
如果treeNumbers中的每个数字子集未被约束为3(如您的描述,实际上,使用&#39; A01.111.23&#39;),那么您需要一种不同的方法来导出parentNumber 。 Neo4j在这里有点弱,因为它缺少indexOf()函数以及用于反转split()的join()函数。您可能需要安装APOC Procedures library才能访问join()函数。
处理treeNumber的数字子集中具有可变位数的情况的查询变为:
MATCH (child:Node)
WHERE child.treeNumber CONTAINS '.'
AND NOT EXISTS( (child)-[:IS_CHILD_OF]->() )
WITH child, SPLIT(child.treeNumber, '.') as splitNumber
CALL apoc.text.join(splitNumber[0..-1], '.') YIELD value AS parentNumber
WITH child, parentNumber
MATCH (parent:Node)
WHERE parent.treeNumber = parentNumber
CREATE UNIQUE (child)-[:IS_CHILD_OF]->(parent)
答案 2 :(得分:0)
我想我刚刚想出了一个解决方案! (如果有人有更优雅的请发帖)
我刚刚意识到&#34; Tree Number&#34;编码系统总是在点之间使用3位数字,即 A01.111.230 或 C02.100 ,因此如果一个节点是另一个节点的直接子节点,它就是&#39 ; s&#34;树数&#34; 不仅以父节点的树编号开头,它也应该长4个字符(点的一个字符&#39;。&#39;和3个字符的数值)
因此,我的解决方案似乎是:
MATCH (n:Node), (n2:Node)
WHERE (n2.treeNumber STARTS WITH n.treeNumber)
AND (length(n2.treeNumber) = (length(n.treeNumber) + 4))
CREATE UNIQUE (n2)-[:IS_CHILD_OF]->(n);
答案 3 :(得分:0)
根据您的要求,STARTS WITH
无法正常工作,因为 A01.111.23 确实以 A01 开头,除了以 A01开头0.111 。
treeNumber
由几个部分组成&#39;。&#39;作为分隔符。我们不对各个部分的最大/最小可能字符长度做任何假设。我们需要的是比较每个节点treeNumber
的所有节点与正在测试的潜在子节点的所有节点。split()
。您可以使用Cypher的MATCH (n1:Node), (n2:Node)
WHERE split(n2.treeNumber,'.')[0..-1] = split(n1.treeNumber,'.')
CREATE UNIQUE (n2)-[:IS_CHILD_OF]->(n1);
函数实现此目的,如下所示:
split()
treeNumber
函数在给定分隔符的每次出现时将字符串拆分为字符串(部分)列表。在这种情况下,分隔符是&#39;。&#39;拆分任何list[{startIndex}..{endIndex}]
。我们可以使用语法treeNumber
在cypher中选择列表的子集。允许反向查找的负索引,例如上述查询中使用的索引。
无论零件数量和单个零件长度如何,此解决方案都应按照当前格式推广到所有可能的SetupDiGetDeviceProperty
值。