想象一下由用于描述它们的URL和标签组成的图形数据库。由此我们希望找到最常使用的标记集,并确定哪些URL属于每个标识集。
我尝试创建一个数据集,在cypher
中简化了这个问题:
CREATE (tech:Tag { name: "tech" }), (comp:Tag { name: "computers" }), (programming:Tag { name: "programming" }), (cat:Tag { name: "cats" }), (mice:Tag { name: "mice" }), (u1:Url { name: "http://u1.com" })-[:IS_ABOUT]->(tech), (u1)-[:IS_ABOUT]->(comp), (u1)-[:IS_ABOUT]->(mice), (u2:Url { name: "http://u2.com" })-[:IS_ABOUT]->(mice), (u2)-[:IS_ABOUT]->(cat), (u3:Url { name: "http://u3.com" })-[:IS_ABOUT]->(tech), (u3)-[:IS_ABOUT]->(programming), (u4:Url { name: "http://u4.com" })-[:IS_ABOUT]->(tech), (u4)-[:IS_ABOUT]->(mice), (u4)-[:IS_ABOUT]->(acc:Tag { name: "accessories" })
使用此作为参考(neo4j console example here),我们可以查看它并直观地确定最常用的代码是tech
和mice
(对此的查询是微不足道的)两者都引用3个URL。最常用的标记对是[tech, mice]
,因为它(在此示例中)是由2个URL(u4和u1)共享的唯一对。重要的是要注意,此标记对是匹配的URL的子集,它不是两者的整个集合。任何网址都没有3个标签的组合。
如何编写cypher
查询以确定哪些标记组合最常使用(成对或N大小组)?也许有更好的方法来构建这些数据,这将使分析更容易?或者这个问题不适合图形数据库?一直在努力想出这个问题,任何帮助或想法都会受到赞赏!
答案 0 :(得分:1)
看起来像组合学上的问题。
// The tags for each URL, sorted by ID
MATCH (U:Url)-[:IS_ABOUT]->(T:Tag)
WITH U, T ORDER BY id(T)
WITH U,
collect(distinct T) as TAGS
// Calc the number of combinations of tags for a node,
// independent of the order of tags
// Since the construction of the power in the cyper is not available,
// use the logarithm and exponent
//
WITH U, TAGS,
toInt(floor(exp(log(2) * size(TAGS)))) as numberOfCombinations
// Iterate through all combinations
UNWIND RANGE(0, numberOfCombinations) as combinationIndex
WITH U, TAGS, combinationIndex
// And check for each tag its presence in combination
// Bitwise operations are missing in the cypher,
// therefore, we use APOC
// https://neo4j-contrib.github.io/neo4j-apoc-procedures/#_bitwise_operations
//
UNWIND RANGE(0, size(TAGS)-1) as tagIndex
WITH U, TAGS, combinationIndex, tagIndex,
toInt(ceil(exp(log(2) * tagIndex))) as pw2
call apoc.bitwise.op(combinationIndex, "&", pw2) YIELD value
WITH U, TAGS, combinationIndex, tagIndex,
value WHERE value > 0
// Get all combinations of tags for URL
WITH U, TAGS, combinationIndex,
collect(TAGS[tagIndex]) as combination
// Return all the possible combinations of tags, sorted by frequency of use
RETURN combination, count(combination) as freq, collect(U) as urls
ORDER BY freq DESC
我认为最好在标记时使用此算法计算和存储标记组合。查询将是这样的:
MATCH (Comb:TagsCombination)<-[:IS_ABOUT]-(U:Url)
WITH Comb, collect(U) as urls, count(U) as freq
MATCH (Comb)-[:CONTAIN]->(T:Tag)
RETURN Comb, collect(T) as Tags, urls, freq ORDER BY freq DESC
答案 1 :(得分:0)
从URL节点开始,构建一个tag.name
个对象的元组(首先对它进行排序,以便它们对它们进行分组)。这将为您提供所有可能存在的标签组合。然后,使用过滤器找出每个可能的标记集匹配的URL数。
MATCH (u:url)
WITH u
MATCH (u) - [:IS_ABOUT] -> (t:tag)
WITH u, t
ORDER BY t.name
WITH u, [x IN COLLECT(t)|x.name] AS tags
WITH DISTINCT tags
MATCH (u)
WHERE ALL(tag IN tags WHERE (u) - [:IS_ABOUT] -> (tag))
RETURN tags, count(u)