查找最常用的不同术语集

时间:2016-09-15 19:07:13

标签: neo4j cypher graph-databases

想象一下由用于描述它们的URL和标签组成的图形数据库。由此我们希望找到最常使用的标记集,并确定哪些URL属于每个标识集。

我尝试创建一个数据集,在cypher中简化了这个问题:

CREATE (tech:Tag { name: "tech" }), (comp:Tag { name: "computers" }), (programming:Tag { name: "programming" }), (cat:Tag { name: "cats" }), (mice:Tag { name: "mice" }), (u1:Url { name: "http://u1.com" })-[:IS_ABOUT]->(tech), (u1)-[:IS_ABOUT]->(comp), (u1)-[:IS_ABOUT]->(mice), (u2:Url { name: "http://u2.com" })-[:IS_ABOUT]->(mice), (u2)-[:IS_ABOUT]->(cat), (u3:Url { name: "http://u3.com" })-[:IS_ABOUT]->(tech), (u3)-[:IS_ABOUT]->(programming), (u4:Url { name: "http://u4.com" })-[:IS_ABOUT]->(tech), (u4)-[:IS_ABOUT]->(mice), (u4)-[:IS_ABOUT]->(acc:Tag { name: "accessories" })

使用此作为参考(neo4j console example here),我们可以查看它并直观地确定最常用的代码是techmice(对此的查询是微不足道的)两者都引用3个URL。最常用的标记对是[tech, mice],因为它(在此示例中)是由2个URL(u4和u1)共享的唯一对。重要的是要注意,此标记对是匹配的URL的子集,它不是两者的整个集合。任何网址都没有3个标签的组合。

如何编写cypher查询以确定哪些标记组合最常使用(成对或N大小组)?也许有更好的方法来构建这些数据,这将使分析更容易?或者这个问题不适合图形数据库?一直在努力想出这个问题,任何帮助或想法都会受到赞赏!

2 个答案:

答案 0 :(得分:1)

看起来像组合学上的问题。

// The tags for each URL, sorted by ID
MATCH (U:Url)-[:IS_ABOUT]->(T:Tag)
WITH U, T ORDER BY id(T)
WITH U, 
     collect(distinct T) as TAGS 

// Calc the number of combinations of tags for a node,
// independent of the order of tags
// Since the construction of the power in the cyper is not available, 
// use the logarithm and exponent
//
WITH U, TAGS, 
     toInt(floor(exp(log(2) * size(TAGS)))) as numberOfCombinations

// Iterate through all combinations
UNWIND RANGE(0, numberOfCombinations) as combinationIndex
WITH U, TAGS, combinationIndex

// And check for each tag its presence in combination
// Bitwise operations are missing in the cypher,
// therefore, we use APOC 
// https://neo4j-contrib.github.io/neo4j-apoc-procedures/#_bitwise_operations
//
UNWIND RANGE(0, size(TAGS)-1) as tagIndex
WITH U, TAGS, combinationIndex, tagIndex, 
     toInt(ceil(exp(log(2) * tagIndex))) as pw2
     call apoc.bitwise.op(combinationIndex, "&", pw2) YIELD value
WITH U, TAGS, combinationIndex, tagIndex,  
     value WHERE value > 0

// Get all combinations of tags for URL
WITH U, TAGS, combinationIndex, 
     collect(TAGS[tagIndex]) as combination

// Return all the possible combinations of tags, sorted by frequency of use
RETURN combination, count(combination) as freq, collect(U) as urls 
       ORDER BY freq DESC

我认为最好在标记时使用此算法计算和存储标记组合。查询将是这样的:

MATCH (Comb:TagsCombination)<-[:IS_ABOUT]-(U:Url)
WITH Comb, collect(U) as urls, count(U) as freq
MATCH (Comb)-[:CONTAIN]->(T:Tag)
RETURN Comb, collect(T) as Tags, urls, freq ORDER BY freq DESC

答案 1 :(得分:0)

从URL节点开始,构建一个tag.name个对象的元组(首先对它进行排序,以便它们对它们进行分组)。这将为您提供所有可能存在的标签组合。然后,使用过滤器找出每个可能的标记集匹配的URL数。

MATCH (u:url)
WITH u
MATCH (u) - [:IS_ABOUT] -> (t:tag)
WITH u, t
ORDER BY t.name
WITH u, [x IN COLLECT(t)|x.name] AS tags
WITH DISTINCT tags
MATCH (u)
WHERE ALL(tag IN tags WHERE (u) - [:IS_ABOUT] -> (tag))
RETURN tags, count(u)