我有一个pandas数据帧(想想是否作为网络中节点的加权邻接矩阵), df ,
A B C D
A 0 0.5 0.5 0
B 1 0 0 0
C 0.8 0 0 0.2
D 0 0 1 0
我想获得一个代表边缘列表的数据帧。对于上面的例子,我需要一些形式, edge_list_df ,
Source Target Weight
0 A B 0.5
1 A C 0.5
2 A D 0
3 B A 1
4 B C 0
5 B D 0
6 C A 0.8
7 C B 0
8 C D 0.2
9 D A 0
10 D B 0
11 D C 1
创建此内容的最有效方法是什么?
答案 0 :(得分:7)
将对角线标记为stack
,然后我们df.values[[np.arange(len(df))]*2] = np.nan
df
Out[172]:
A B C D
A NaN 0.5 0.5 0.0
B 1.0 NaN 0.0 0.0
C 0.8 0.0 NaN 0.2
D 0.0 0.0 1.0 NaN
df.stack().reset_index()
Out[173]:
level_0 level_1 0
0 A B 0.5
1 A C 0.5
2 A D 0.0
3 B A 1.0
4 B C 0.0
5 B D 0.0
6 C A 0.8
7 C B 0.0
8 C D 0.2
9 D A 0.0
10 D B 0.0
11 D C 1.0
$min = 0;
$output = false;
foreach ($arr as $key => $plas) {
list($dist, $units) = explode(' ', $plas['distance']);
switch ($units) {
case 'km':
$base_dist = ($dist * 1000);
break;
case 'm':
$base_dist = ($dist * 1);
break;
default:
throw new Exception("Unit is invalid...!");
break;
}
if ($base_dist < $min || $min == 0) {
$min = $base_dist;
$output = array($key => $plas);
}
}
print_r($output);
答案 1 :(得分:6)
使用rename_axis
+ reset_index
+ melt
:
df.rename_axis('Source')\
.reset_index()\
.melt('Source', value_name='Weight', var_name='Target')\
.query('Source != Target')\
.reset_index(drop=True)
Source Target Weight
0 B A 1.0
1 C A 0.8
2 D A 0.0
3 A B 0.5
4 C B 0.0
5 D B 0.0
6 A C 0.5
7 B C 0.0
8 D C 1.0
9 A D 0.0
10 B D 0.0
11 C D 0.2
melt
已作为DataFrame
对象的函数引入0.20
,对于旧版本,您需要pd.melt
代替:
v = df.rename_axis('Source').reset_index()
df = pd.melt(
v,
id_vars='Source',
value_name='Weight',
var_name='Target'
).query('Source != Target')\
.reset_index(drop=True)
<强>计时强>
x = np.random.randn(1000, 1000)
x[[np.arange(len(x))] * 2] = 0
df = pd.DataFrame(x)
%%timeit
df.index.name = 'Source'
df.reset_index()\
.melt('Source', value_name='Weight', var_name='Target')\
.query('Source != Target')\
.reset_index(drop=True)
1 loop, best of 3: 139 ms per loop
# Wen's solution
%%timeit
df.values[[np.arange(len(df))]*2] = np.nan
df.stack().reset_index()
10 loops, best of 3: 45 ms per loop
答案 2 :(得分:4)
使用NumPy工具的两种方法 -
方法#1
def edgelist(df):
a = df.values
c = df.columns
n = len(c)
c_ar = np.array(c)
out = np.empty((n, n, 2), dtype=c_ar.dtype)
out[...,0] = c_ar[:,None]
out[...,1] = c_ar
mask = ~np.eye(n,dtype=bool)
df_out = pd.DataFrame(out[mask], columns=[['Source','Target']])
df_out['Weight'] = a[mask]
return df_out
示例运行 -
In [155]: df
Out[155]:
A B C D
A 0.0 0.5 0.5 0.0
B 1.0 0.0 0.0 0.0
C 0.8 0.0 0.0 0.2
D 0.0 0.0 1.0 0.0
In [156]: edgelist(df)
Out[156]:
Source Target Weight
0 A B 0.5
1 A C 0.5
2 A D 0.0
3 B A 1.0
4 B C 0.0
5 B D 0.0
6 C A 0.8
7 C B 0.0
8 C D 0.2
9 D A 0.0
10 D B 0.0
11 D C 1.0
方法#2
# https://stackoverflow.com/a/46736275/ @Divakar
def skip_diag_strided(A):
m = A.shape[0]
strided = np.lib.stride_tricks.as_strided
s0,s1 = A.strides
return strided(A.ravel()[1:], shape=(m-1,m), strides=(s0+s1,s1))
# https://stackoverflow.com/a/48234170/ @Divakar
def combinations_without_repeat(a):
n = len(a)
out = np.empty((n,n-1,2),dtype=a.dtype)
out[:,:,0] = np.broadcast_to(a[:,None], (n, n-1))
out.shape = (n-1,n,2)
out[:,:,1] = onecold(a)
out.shape = (-1,2)
return out
cols = df.columns.values.astype('S1')
df_out = pd.DataFrame(combinations_without_repeat(cols))
df_out['Weight'] = skip_diag_strided(df.values.copy()).ravel()
In [704]: x = np.random.randn(1000, 1000)
...: x[[np.arange(len(x))] * 2] = 0
...:
...: df = pd.DataFrame(x)
# @cᴏʟᴅsᴘᴇᴇᴅ's soln
In [705]: %%timeit
...: df.index.name = 'Source'
...: df.reset_index()\
...: .melt('Source', value_name='Weight', var_name='Target')\
...: .query('Source != Target')\
...: .reset_index(drop=True)
10 loops, best of 3: 67.4 ms per loop
# @Wen's soln
In [706]: %%timeit
...: df.values[[np.arange(len(df))]*2] = np.nan
...: df.stack().reset_index()
100 loops, best of 3: 19.6 ms per loop
# Proposed in this post - Approach #1
In [707]: %timeit edgelist(df)
10 loops, best of 3: 24.8 ms per loop
# Proposed in this post - Approach #2
In [708]: %%timeit
...: cols = df.columns.values.astype('S1')
...: df_out = pd.DataFrame(combinations_without_repeat(cols))
...: df_out['Weight'] = skip_diag_strided(df.values.copy()).ravel()
100 loops, best of 3: 17.4 ms per loop
答案 3 :(得分:0)
print("Gene length computation for C. elegans.")
print()
file1 = "C.elegans_small.gff"
file2 = "C.elegans.gff"
user_input = input("Input a file name: ")
while user_input != file1 or user_input != file2:
print("Unable to open file.")
user_input = input("Input a file name: ")
if user_input == file1 or user_input == file2:
break