分析大图 - 检索聚类和计算最强路径

时间:2010-11-08 13:51:57

标签: sql graph-theory social-networking

我尝试使用广度优先算法以下列方式检索所选用户所属的整个连接群集:

CREATE PROCEDURE Breadth_First (@StartNode varchar(50), @LinkStrength decimal(10,7) = 0.1, @EndNode varchar(50) = NULL)
    AS
    BEGIN
        -- Automatically rollback the transaction if something goes wrong.   
        SET XACT_ABORT ON   
        BEGIN TRAN

        -- Increase performance and do not intefere with the results.
        SET NOCOUNT ON;

        -- Create a temporary table for storing the discovered nodes as the algorithm runs
        CREATE TABLE #Discovered
        (
             DiscoveredUser varchar(50) NOT NULL,    -- The Node Id
            Predecessor varchar(50) NULL,    -- The node we came from to get to this node.
            LinkStrength decimal(10,7) NULL, -- positive - from predecessor to  DiscoveredUser, negative - from  DiscoveredUser to predecessor
            OrderDiscovered int -- The order in which the nodes were discovered.
        )

        -- Initially, only the start node is discovered.
        INSERT INTO #Discovered ( DiscoveredUser, Predecessor, LinkStrength, OrderDiscovered)
        VALUES (@StartNode, NULL, NULL, 0)

        -- Add all nodes that we can get to from the current set of nodes,
        -- that are not already discovered. Run until no more nodes are discovered.
        WHILE @@ROWCOUNT > 0
        BEGIN
            -- If we have found the node we were looking for, abort now.
            IF @EndNode IS NOT NULL
                IF EXISTS (SELECT TOP 1 1 FROM #Discovered WHERE  DiscoveredUser = @EndNode)
                    BREAK   

            -- We need to group by ToNode and select one FromNode since multiple
            -- edges could lead us to new same node, and we only want to insert it once.
            INSERT INTO #Discovered ( DiscoveredUser, Predecessor, LinkStrength, OrderDiscovered)
            (SELECT mc.called_party, mc.calling_party, mc.link_strength, d.OrderDiscovered + 1
            FROM #Discovered d JOIN monthly_connections mc ON d. DiscoveredUser = mc.calling_party
            WHERE mc.called_party NOT IN (SELECT  DiscoveredUser From #Discovered) AND mc.link_strength > @LinkStrength
            UNION
            SELECT mc.calling_party, mc.called_party, mc.link_strength * (-1), d.OrderDiscovered + 1
            FROM #Discovered d JOIN monthly_connections mc ON d. DiscoveredUser = mc.called_party
            WHERE mc.calling_party NOT IN (SELECT  DiscoveredUser FROM #Discovered) AND mc.link_strength > @LinkStrength
            )
        END;

        -- Select the results. We use a recursive common table expression to
        -- get the full path from the start node to the current node.
        WITH BacktraceCTE(Predecessor,  DiscoveredUser, LinkStrength, OrderDiscovered, Path)
        AS
        (
            -- Anchor/base member of the recursion, this selects the start node.
            SELECT d.Predecessor, n. DiscoveredUser, d.LinkStrength, d.OrderDiscovered, 
                CAST(n. DiscoveredUser AS varchar(MAX))
            FROM #Discovered d JOIN users n ON d. DiscoveredUser = n. DiscoveredUser
            WHERE d. DiscoveredUser = @StartNode

            UNION ALL

            -- Recursive member, select all the nodes which have the previous
            -- one as their predecessor. Concat the paths together.
            SELECT d.Predecessor, n. DiscoveredUser, d.LinkStrength, d.OrderDiscovered,
                CAST(cte.Path + ',' + CAST(n. DiscoveredUser as varchar(30)) as varchar(MAX))
            FROM #Discovered d JOIN BacktraceCTE cte ON d.Predecessor = cte. DiscoveredUser
            JOIN users n ON d. DiscoveredUser = n. DiscoveredUser
        )

        SELECT Predecessor,  DiscoveredUser, LinkStrength, OrderDiscovered, Path FROM BacktraceCTE
        WHERE  DiscoveredUser = @EndNode OR @EndNode IS NULL -- This kind of where clause can potentially produce
        ORDER BY OrderDiscovered                -- a bad execution plan, but I use it for simplicity here.

        DROP TABLE #Discovered
        COMMIT TRAN
        RETURN 0
    END

我目前正在分析的图形(社交网络)具有28M连接并且没有忽略弱连接(使用@LinkStrength设置阈值)执行运行很长时间(到目前为止我没有得到任何结果并将尝试让它运行一夜。)

无论如何,下一步是计算两个用户之间的最短(最强)链接(大约有3M用户)。我正在考虑使用Djikstra算法,但不确定是否有可能在我目前正在使用的PC上分析这样的网络(四核CPU 2.66 GHz,4GB RAM)并且数据存储在MS SQL Server 2008数据库中。

总结一下,我希望得到以下问题的答案/建议:

  1. 是否可以执行 查询与上面的查询一样复杂 描述图(28M连接,3M 用户)在描述的机器上 (2.66 GHz,4GB RAM)?
  2. 如果不可能有 其他可能的方法 执行时间可以缩短 (例如,创建具有部分的表格 结果)?
  3. 你推荐别的吗? 用于检测聚类的算法 计算最短路径 描述的图表?
  4. 谢谢!

3 个答案:

答案 0 :(得分:1)

首先,使用索引

其次,您需要降低内存需求。这意味着首先为VARCHAR(50)提供一个简短的别名,例如int,它是4个字节而不是50个。这将使你获得10倍的加速。

declare @tmpPeople table(
  ixPerson int identity primary key,
  UserNodeID varchar(50),
  unique(UserNodeID, ix) -- this creates an index
)
Insert @tmpPeople(UserNodeID) select UserNodeID from NormalPeopleTable
declare @relationships table(
  ixParent int,
  ixChild int,
  unique(ixParent, ixChild),
  unique(ixChild, ixParent)
)
insert @relationships(ixParent, ixChild)
select distinct p.ixPerson, c.ixPerson
from NormalRelationshipsTable nr
inner join @tmpPeople p on p.UserNodeID = nr.ParentUserNodeID
inner join @tmpPeople c on c.UserNodeID = nr.ChildUserNodeID

-- OK now got a copy of all relationships, but it is a fraction of the size
-- because we are using integers for the keys.
-- if we need to we can insert the reverse relationships too.

您需要编写一个符合您需要的查询,而不是“通用”。

如果要查找两个节点之间的最短距离,可以一次性从两端搜索来缩短搜索时间。

declare @p1 table(
ix int identity primary key,
ixParent int,
ixChild int,
nDeep int,
-- Need indexes
unique(ixParent, ixChild, nDeep, ix),
unique(ixChild, ixParent, nDeep, ix)
)
-- That's now indexed both ways. 
-- If you only need one, you can comment the other out.
-- define @p2 the same

insert @p1 (ixParent, ixChild, nDeep) select @ixUserFrom, @ixUserFrom, 0
insert @p2 ..... @ixUserTo, @ixUserTo, 0

-- Breadth first goes like this.
-- Just keep repeating it till you get as far as you want.
insert @p1 (ixParent, ixChild, nDeep)
select
p1.ixChild, r.ixChild, p1.nDeep+1
from @p1 p1 inner join @relationships r on r.ixParent = p1.ixChild
-- may want to exclude those already in the table
where not exists (
    select 1 from @p1 p1 
    where p1.ixParent = p.ixChild and p1.ixChild = r.ixChild
)

对于“从Alice到Bob的距离”,您可以并行进行两次搜索,并在Alice的搜索包含Bob搜索中包含的任何人时停止。这也会使你的时间减少n ^ 2,其中n是平均连接数。

如果深度太大,请不要忘记停止。

答案 1 :(得分:0)

如果你想要一个确切的答案,无论你先寻求广度还是先深度优先并不重要。确切的答案将需要详尽的搜索,这将是缓慢的。

像fmark建议的那样,启发式可以帮助您找到具有合理程度的确定性的潜在最大解决方案。它会为你节省很多时间,但这并不准确。

你必须选择速度或精确度,你不能真正拥有两者。这就像照片(或声音或视频)的图像压缩:大多数自然场景的照片都是无损的,但不会压缩很多,jpeg压缩得很好,但有一些损失。

编辑1:我能想到的唯一可以帮助你进行精确搜索的是稀疏矩阵的数学理论。你的问题类似于将社会关系强度矩阵提升到一系列不同的权力(权力n =人A和B之间的步长),并找出哪些单元格对每个(A,B)对具有最高值。这就是您对查询所做的事情,只有数据库查询可能不是实现此目的的最快方法。

虽然我对此无法帮助你。您可能需要查看Wikipedia for Sparse Matrix

编辑2:我只想到了一件事。我不知道如何通过SQL查询来剔除你知道肯定会弱的分支,而使用定制的算法来处理稀疏矩阵,应该很容易剔除你知道可以消除的分支,基于在你的力量模型上。

答案 2 :(得分:0)

在进行分析之前,首先迁移到Graph DB可能会有所帮助。我没有亲自使用它们,但建议我尝试neo4j

HTH