我很难弄清楚如何更新networkx dag_find_longest_path()算法以使领带返回“ N”,而不是返回找到的第一个最大边缘,或返回所有最大重量相关的边缘的列表。
我首先从pandas数据框中创建了一个DAG,其中包含一个类似于以下子集的边列表:
edge1 edge2 weight
115252161:T 115252162:A 1.0
115252162:A 115252163:G 1.0
115252163:G 115252164:C 3.0
115252164:C 115252165:A 5.5
115252165:A 115252166:C 5.5
115252162:T 115252163:G 1.0
115252166:C 115252167:A 7.5
115252167:A 115252168:A 7.5
115252168:A 115252169:A 6.5
115252165:A 115252166:G 0.5
然后,我使用以下代码对图形进行拓扑排序,然后根据边缘的权重找到最长的路径:
G = nx.from_pandas_edgelist(edge_df, source="edge1",
target="edge2",
edge_attr=['weight'],
create_using=nx.OrderedDiGraph())
longest_path = pd.DataFrame(nx.dag_longest_path(G))
这很好用,除非当有最大加权边缘的平局时,它返回找到的第一个最大边缘,而我希望它只返回代表“ Null”的“ N”。 因此,当前输出为:
115252161 T
115252162 A
115252163 G
115252164 C
115252165 A
115252166 C
但是我真正需要的是:
115252161 T
115252162 N (or [A,T] )
115252163 G
115252164 C
115252165 A
115252166 C
找到最长路径的算法是:
def dag_longest_path(G):
dist = {} # stores [node, distance] pair
for node in nx.topological_sort(G):
# pairs of dist,node for all incoming edges
pairs = [(dist[v][0] + 1, v) for v in G.pred[node]]
if pairs:
dist[node] = max(pairs)
else:
dist[node] = (0, node)
node, (length, _) = max(dist.items(), key=lambda x: x[1])
path = []
while length > 0:
path.append(node)
length, node = dist[node]
return list(reversed(path))
可复制复制的G
定义。
import pandas as pd
import networkx as nx
import numpy as np
edge_df = pd.read_csv(
pd.compat.StringIO(
"""edge1 edge2 weight
115252161:T 115252162:A 1.0
115252162:A 115252163:G 1.0
115252163:G 115252164:C 3.0
115252164:C 115252165:A 5.5
115252165:A 115252166:C 5.5
115252162:T 115252163:G 1.0
115252166:C 115252167:A 7.5
115252167:A 115252168:A 7.5
115252168:A 115252169:A 6.5
115252165:A 115252166:G 0.5"""
),
sep=r" +",
)
G = nx.from_pandas_edgelist(
edge_df,
source="edge1",
target="edge2",
edge_attr=["weight"],
create_using=nx.OrderedDiGraph(),
)
longest_path = pd.DataFrame(nx.dag_longest_path(G))
答案 0 :(得分:1)
该函数中的这一行似乎放弃了所需的路径;因为max
仅返回一个结果:
node, (length, _) = max(dist.items(), key=lambda x: x[1])
我会保留最大值,然后根据它搜索所有项目。然后重用代码查找所需的路径。一个例子是这样的:
def dag_longest_path(G):
dist = {} # stores [node, distance] pair
for node in nx.topological_sort(G):
# pairs of dist,node for all incoming edges
pairs = [(dist[v][0] + 1, v) for v in G.pred[node]]
if pairs:
dist[node] = max(pairs)
else:
dist[node] = (0, node)
# store max value inside val variable
node, (length, val) = max(dist.items(), key=lambda x: x[1])
# find all dictionary items that have the maximum value
nodes = [(item[0], item[1][0]) for item in dist.items() if item[1][1] == val]
paths = []
# iterate over the different nodes and append the paths to a list
for node, length in nodes:
path = []
while length > 0:
path.append(node)
length, node = dist[node]
paths.append(list(reversed(path)))
return paths
PS。我尚未测试此代码以了解其是否正常运行。
答案 1 :(得分:1)
从您的示例来看,每个节点由位置ID(:
之前的数字)确定,并且两个附加了不同碱基的节点对于计算路径长度是相同的。如果正确,则无需修改算法,并且可以通过操纵顶点标签来获得结果。
基本上,将所有内容放在edge_df
中的分号之后,计算最长路径,并从原始数据中附加基本标签。
edge_df_pos = pd.DataFrame(
{
"edge1": edge_df.edge1.str.partition(":")[0],
"edge2": edge_df.edge2.str.partition(":")[0],
"weight": edge_df.weight,
}
)
vert_labels = dict()
for col in ("edge1", "edge2"):
verts, lbls = edge_df[col].str.partition(":")[[0, 2]].values.T
for vert, lbl in zip(verts, lbls):
vert_labels.setdefault(vert, set()).add(lbl)
G_pos = nx.from_pandas_edgelist(
edge_df_pos,
source="edge1",
target="edge2",
edge_attr=["weight"],
create_using=nx.OrderedDiGraph(),
)
longest_path_pos = nx.dag_longest_path(G_pos)
longest_path_df = pd.DataFrame([[node, vert_labels[node]] for node in longest_path_pos])
# 0 1
# 0 115252161 {T}
# 1 115252162 {A, T}
# 2 115252163 {G}
# 3 115252164 {C}
# 4 115252165 {A}
# 5 115252166 {G, C}
# 6 115252167 {A}
# 7 115252168 {A}
# 8 115252169 {A}
如果我的解释不正确,我怀疑是否存在基于拓扑排序的算法的简单扩展。问题是图可以接受多种拓扑排序。如果按照示例中dist
的定义打印dag_longest_path
,则会得到以下内容:
{'115252161:T': (0, '115252161:T'),
'115252162:A': (1, '115252161:T'),
'115252162:T': (0, '115252162:T'),
'115252163:G': (2, '115252162:A'),
'115252164:C': (3, '115252163:G'),
'115252165:A': (4, '115252164:C'),
'115252166:C': (5, '115252165:A'),
'115252166:G': (5, '115252165:A'),
'115252167:A': (6, '115252166:C'),
'115252168:A': (7, '115252167:A'),
'115252169:A': (8, '115252168:A')}
请注意,'115252162:T'
出现在第三行,没有其他地方。因此,dist
不能区分您的示例和另一个'115252162:T'
作为不相交成分出现的示例。因此,仅使用'115252162:T'
中的数据就不可能通过dist
恢复任何路径。
答案 2 :(得分:1)
我最终只是在defaultdict计数器对象中对行为建模。
var_H = 360 - ( abs( var_H ) / PI ) * 180
我将边缘列表修改为(位置,核苷酸,重量)元组:
from collections import defaultdict, Counter
然后使用defaultdict(counter)快速求和每个核苷酸在每个位置的出现:
test = [(112,"A",23.0), (113, "T", 27), (112, "T", 12.0), (113, "A", 27), (112,"A", 1.0)]
然后遍历字典以提取所有等于最大值的核苷酸:
nucs = defaultdict(Counter)
for key, nuc, weight in test:
nucs[key][nuc] += weight
这将返回找到的最大值的核苷酸的最终序列,并在平局位置返回N:
for key, nuc in nucs.items():
seq_list = []
max_nuc = []
max_val = max(nuc.values())
for x, y in nuc.items():
if y == max_val:
max_nuc.append(x)
if len(max_nuc) != 1:
max_nuc = "N"
else:
max_nuc = ''.join(max_nuc)
seq_list.append(max_nuc)
sequence = ''.join(seq_list)
但是,这个问题困扰着我,所以我最终使用networkx中的节点属性作为将每个节点标记为平局的手段。现在,当最长路径中的节点返回时,我可以检查“ tie”属性,如果已标记该节点名称,则将其替换为“ N”:
TNGCACAAATGCTGAAAGCTGTACCATANCTGTCTGGTCTTGGCTGAGGTTTCAATGAATGGAATCCCGTAACTCTTGGCCAGTTCGTGGGCTTGTTTTGTATCAACTGTCCTTGTTGGCAAATCACACTTGTTTCCCACTAGCACCAT