从节点列表

时间:2017-05-19 19:30:53

标签: python pandas nodes networkx edge-list

我有数据集,其节点超过50k,我试图从中提取可能的边和社区。我尝试使用一些图形工具,如gephi,cytoscape,socnet,nodexl等来可视化和识别边缘和社区,但节点列表对于这些工具来说太大了。因此,我正在尝试编写脚本以确定边缘和社区。其他列是具有GPS位置的连接开始日期时间和结束日期时间。

输入:

标识,开始时间,结束时间,GPS1,GPS2

0022d9064bc,1073260801,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494,1073260804,1073262625,817558,439525
00904b14b494,1073260804,1073265163,817558,439525
00904b14b494,1073260804,1073263786,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d1406df,1073260807,1073260878,820428,438735
00022d623dfe,1073260810,1073276346,819251,440006
00022d7317d7,1073260810,1073276155,819251,440006
00022d9064bc,1073260810,1073272525,819251,440006
00022d9064bc,1073260810,1073260999,819251,440006
00022d9064bc,1073260810,1073260857,819251,440006
0030650c9eda,1073260811,1073260813,820356,439224
00022d0e0cec,1073260813,1073262843,820187,439271
00022d176cf3,1073260813,1073260962,817721,439564
000c30d8d2e8,1073260813,1073260902,817721,439564
00904b243bc4,1073260813,1073260962,817721,439564
00904b2fc34d,1073260813,1073260962,817721,439564
00904b52b839,1073260813,1073260962,817721,439564
00904b9a5a51,1073260813,1073260962,817721,439564
00904ba8b682,1073260813,1073260962,817721,439564
00022d3be9cd,1073260815,1073261114,819269,439403
00022d80381f,1073260815,1073261114,819269,439403
00022dc1b09c,1073260815,1073261114,819269,439403
00022d36a6df,1073260817,1073260836,820761,438607
00022d36a6df,1073260817,1073260845,820761,438607
003065d2d8b6,1073260817,1073267560,817735,439757
00904b0c7856,1073260817,1073265149,817735,439757
00022de73863,1073260825,1073260879,817558,439525
00904b14b494,1073260825,1073260879,817558,439525
00904b312d9e,1073260825,1073260879,817558,439525
00022d15b1c7,1073260826,1073260966,820353,439280
00022dcbe817,1073260826,1073260966,820353,439280

我正在尝试实现无向加权/未加权图。

1 个答案:

答案 0 :(得分:7)

使用Pandas将数据导入成对节点列表,其中每行代表一条边,基于您的边缘标准。然后迁移到networkx对象进行图形分析。

共享边缘的两个节点的标准包括:

  1. 相同位置假设这意味着gps1gps2相同。
  2. “接近相同的开始和结束时间”这有点模棱两可。出于这个答案的目的,我已将此标准缩减为“在相同的5秒间隔内开始时间”。如果你想在边缘上应用额外的时间条件,那么扩展我在这里采用的groupby方法应该不会太难。
  3. 由于我们希望根据时间戳操纵数据,请将startend转换为datetime dtype

    df.start = pd.to_datetime(df.start, unit="s")
    df.end = pd.to_datetime(df.end, unit="s")
    
    df.start.describe()
    count                      35
    unique                     11
    top       2004-01-05 00:00:13
    freq                        8
    first     2004-01-05 00:00:01
    last      2004-01-05 00:00:26
    Name: start, dtype: object
    
    df.head()
                 ID               start                 end    gps1    gps2
    0   0022d9064bc 2004-01-05 00:00:01 2004-01-05 00:00:03  819251  440006
    1  00022d9064bc 2004-01-05 00:00:03 2004-01-05 00:00:10  819213  439954
    2  00904b4557d3 2004-01-05 00:00:03 2004-01-05 00:18:40  817526  439458
    3  00022de73863 2004-01-05 00:00:04 2004-01-05 01:16:50  817558  439525
    4  00904b14b494 2004-01-05 00:00:04 2004-01-05 00:30:25  817558  439525
    

    样本观察发生在彼此的几秒钟内,因此我们将grouping frequency设置为仅几秒钟:

    near = "5s" 
    

    现在groupby找到已连接节点的位置和开始时间:

    edges = (df.groupby(["gps1",
                         "gps2",
                         pd.Grouper(key="start", 
                                    freq=near, 
                                    closed="right", 
                                    label="right")], 
                       as_index=False)
               .agg({"ID":','.join,
                     "start":"min",
                     "end":"max"})
                .reset_index()
                .rename(columns={"index":"edge",
                                 "start":"start_min", 
                                 "end":"end_max"})
            )
    
    edges.ID = edges.ID.str.split(",")
    

    edges.head()

       edge    gps1    gps2                                                 ID  \
    0     0  817526  439458                                     [00904b4557d3]   
    1     1  817558  439525  [00022de73863, 00904b14b494, 00904b14b494, 009...   
    2     2  817558  439525         [00022de73863, 00904b14b494, 00904b312d9e]   
    3     3  817721  439564  [00022d176cf3, 000c30d8d2e8, 00904b243bc4, 009...   
    4     4  817735  439757                       [003065d2d8b6, 00904b0c7856]   
    
                start_min             end_max  
    0 2004-01-05 00:00:03 2004-01-05 00:18:40  
    1 2004-01-05 00:00:04 2004-01-05 01:16:50  
    2 2004-01-05 00:00:25 2004-01-05 00:01:19  
    3 2004-01-05 00:00:13 2004-01-05 00:02:42  
    4 2004-01-05 00:00:17 2004-01-05 01:52:40 
    

    现在每行代表一个唯一的边缘类别。 ID是所有共享该边缘的节点列表。将此列表放入节点对的新结构中有点棘手;我使用了一些老式的嵌套for循环。可能有一些Pandas-fu可以提高效率:

    注意:对于单例节点,我为其配对分配了None值。如果您不想跟踪单身人士,请忽略if not len(combos): ...逻辑。

    pairs = []
    idx = 0
    for e in edges.edge.values:
        nodes = edges.loc[edges.edge==e, "ID"].values[0]
        attrs = edges.loc[edges.edge==e, ["gps1","gps2","start_min","end_max"]]
        combos = list(combinations(nodes, 2))
        if not len(combos):
            pair = [e, nodes[0], None]
            pair.extend(attrs.values[0])
            pairs.append(pair)
            idx += 1
        else:
            for combo in combos:
                pair = [e, combo[0], combo[1]]
                pair.extend(attrs.values[0])
                pairs.append(pair)
                idx += 1
    cols = ["edge","nodeA","nodeB","gps1","gps2","start_min","end_max"]
    pairs_df = pd.DataFrame(pairs, columns=cols)    
    

    pairs_df.head()

       edge         nodeA         nodeB    gps1    gps2           start_min  \
    0     0  00904b4557d3          None  817526  439458 2004-01-05 00:00:03   
    1     1  00022de73863  00904b14b494  817558  439525 2004-01-05 00:00:04   
    2     1  00022de73863  00904b14b494  817558  439525 2004-01-05 00:00:04   
    3     1  00022de73863  00904b14b494  817558  439525 2004-01-05 00:00:04   
    4     1  00904b14b494  00904b14b494  817558  439525 2004-01-05 00:00:04   
    
                  end_max  
    0 2004-01-05 00:18:40  
    1 2004-01-05 01:16:50  
    2 2004-01-05 01:16:50  
    3 2004-01-05 01:16:50  
    4 2004-01-05 01:16:50      
    

    现在数据可以适合networkx对象:

    import networkx as nx
    
    g = nx.from_pandas_dataframe(pairs_df, "nodeA", "nodeB", edge_attr=True)
    
    # access edge attributes by node pairing:
    test_A = "00022de73863"
    test_B = "00904b14b494"
    g[test_A][test_B]["start_min"]
    # output:
    Timestamp('2004-01-05 00:00:25')
    

    对于社区检测,有几种选择。考虑networkx community algorithms以及community模块,该模块基于本机networkx功能构建。

    我读到的问题主要涉及将数据操作为适合网络分析的格式。由于这个答案已经足够长了,我会留给你去寻求社区检测策略 - 有几种方法可以与我在这里链接的模块开箱即用。