如何从Python中的邻接矩阵创建边列表数据框?

时间:2018-01-12 01:50:02

标签: python pandas numpy dataframe

我有一个pandas数据帧(想想是否作为网络中节点的加权邻接矩阵), df

    A    B    C    D
A   0   0.5   0.5  0 
B   1    0    0    0
C   0.8  0    0   0.2
D   0    0    1    0

我想获得一个代表边缘列表的数据帧。对于上面的例子,我需要一些形式, edge_list_df

    Source    Target    Weight    
0   A           B        0.5 
1   A           C        0.5
2   A           D        0
3   B           A        1
4   B           C        0
5   B           D        0
6   C           A        0.8
7   C           B        0
8   C           D        0.2
9   D           A        0
10  D           B        0
11  D           C        1

创建此内容的最有效方法是什么?

4 个答案:

答案 0 :(得分:7)

将对角线标记为stack,然后我们df.values[[np.arange(len(df))]*2] = np.nan df Out[172]: A B C D A NaN 0.5 0.5 0.0 B 1.0 NaN 0.0 0.0 C 0.8 0.0 NaN 0.2 D 0.0 0.0 1.0 NaN df.stack().reset_index() Out[173]: level_0 level_1 0 0 A B 0.5 1 A C 0.5 2 A D 0.0 3 B A 1.0 4 B C 0.0 5 B D 0.0 6 C A 0.8 7 C B 0.0 8 C D 0.2 9 D A 0.0 10 D B 0.0 11 D C 1.0

$min = 0;
$output = false;
foreach ($arr as $key => $plas) {
    list($dist, $units) = explode(' ', $plas['distance']);

    switch ($units) {
        case 'km':
            $base_dist = ($dist * 1000);
            break;
        case 'm':
            $base_dist = ($dist * 1);
            break;
        default:
            throw new Exception("Unit is invalid...!");
            break;
    }
    if ($base_dist < $min || $min == 0) {
        $min = $base_dist;
        $output = array($key => $plas);
    }
}
print_r($output);

答案 1 :(得分:6)

使用rename_axis + reset_index + melt

df.rename_axis('Source')\
  .reset_index()\
  .melt('Source', value_name='Weight', var_name='Target')\
  .query('Source != Target')\
  .reset_index(drop=True)

  Source Target  Weight
0       B      A     1.0
1       C      A     0.8
2       D      A     0.0
3       A      B     0.5
4       C      B     0.0
5       D      B     0.0
6       A      C     0.5
7       B      C     0.0
8       D      C     1.0
9       A      D     0.0
10      B      D     0.0
11      C      D     0.2

melt已作为DataFrame对象的函数引入0.20,对于旧版本,您需要pd.melt代替:

v = df.rename_axis('Source').reset_index()
df = pd.melt(
      v, 
      id_vars='Source', 
      value_name='Weight', 
      var_name='Target'
).query('Source != Target')\
 .reset_index(drop=True)

<强>计时

x = np.random.randn(1000, 1000)
x[[np.arange(len(x))] * 2] = 0

df = pd.DataFrame(x)

%%timeit
df.index.name = 'Source'
df.reset_index()\
  .melt('Source', value_name='Weight', var_name='Target')\
  .query('Source != Target')\
  .reset_index(drop=True)

1 loop, best of 3: 139 ms per loop

# Wen's solution

%%timeit
df.values[[np.arange(len(df))]*2] = np.nan
df.stack().reset_index()

10 loops, best of 3: 45 ms per loop

答案 2 :(得分:4)

使用NumPy工具的两种方法 -

方法#1

def edgelist(df):
    a = df.values
    c = df.columns
    n = len(c)

    c_ar = np.array(c)
    out = np.empty((n, n, 2), dtype=c_ar.dtype)

    out[...,0] = c_ar[:,None]
    out[...,1] = c_ar

    mask = ~np.eye(n,dtype=bool)
    df_out = pd.DataFrame(out[mask], columns=[['Source','Target']])
    df_out['Weight'] = a[mask]
    return df_out

示例运行 -

In [155]: df
Out[155]: 
     A    B    C    D
A  0.0  0.5  0.5  0.0
B  1.0  0.0  0.0  0.0
C  0.8  0.0  0.0  0.2
D  0.0  0.0  1.0  0.0

In [156]: edgelist(df)
Out[156]: 
   Source Target  Weight
0       A      B     0.5
1       A      C     0.5
2       A      D     0.0
3       B      A     1.0
4       B      C     0.0
5       B      D     0.0
6       C      A     0.8
7       C      B     0.0
8       C      D     0.2
9       D      A     0.0
10      D      B     0.0
11      D      C     1.0

方法#2

# https://stackoverflow.com/a/46736275/ @Divakar
def skip_diag_strided(A):
    m = A.shape[0]
    strided = np.lib.stride_tricks.as_strided
    s0,s1 = A.strides
    return strided(A.ravel()[1:], shape=(m-1,m), strides=(s0+s1,s1))

# https://stackoverflow.com/a/48234170/ @Divakar
def combinations_without_repeat(a):
    n = len(a)
    out = np.empty((n,n-1,2),dtype=a.dtype)
    out[:,:,0] = np.broadcast_to(a[:,None], (n, n-1))
    out.shape = (n-1,n,2)
    out[:,:,1] = onecold(a)
    out.shape = (-1,2)
    return out  

cols = df.columns.values.astype('S1')
df_out = pd.DataFrame(combinations_without_repeat(cols))
df_out['Weight'] = skip_diag_strided(df.values.copy()).ravel()

运行时测试

使用@cᴏʟᴅsᴘᴇᴇᴅ's timing setup

In [704]: x = np.random.randn(1000, 1000)
     ...: x[[np.arange(len(x))] * 2] = 0
     ...: 
     ...: df = pd.DataFrame(x)

# @cᴏʟᴅsᴘᴇᴇᴅ's soln
In [705]: %%timeit
     ...: df.index.name = 'Source'
     ...: df.reset_index()\
     ...:   .melt('Source', value_name='Weight', var_name='Target')\
     ...:   .query('Source != Target')\
     ...:   .reset_index(drop=True)
10 loops, best of 3: 67.4 ms per loop

# @Wen's soln
In [706]: %%timeit
     ...: df.values[[np.arange(len(df))]*2] = np.nan
     ...: df.stack().reset_index()
100 loops, best of 3: 19.6 ms per loop

# Proposed in this post - Approach #1
In [707]: %timeit edgelist(df)
10 loops, best of 3: 24.8 ms per loop

# Proposed in this post - Approach #2
In [708]: %%timeit
     ...: cols = df.columns.values.astype('S1')
     ...: df_out = pd.DataFrame(combinations_without_repeat(cols))
     ...: df_out['Weight'] = skip_diag_strided(df.values.copy()).ravel()
100 loops, best of 3: 17.4 ms per loop

答案 3 :(得分:0)

使用NetworkX 2.x API

print("Gene length computation for C. elegans.")
print()
file1 = "C.elegans_small.gff"
file2 = "C.elegans.gff"
user_input = input("Input a file name: ")
while user_input != file1 or user_input != file2:
    print("Unable to open file.") 
    user_input = input("Input a file name: ")
    if user_input == file1 or user_input == file2:
        break