Question

这是我第一次使用Python以字典格式从分层数据进行可视化。最后一部分数据如下：

d = {^2820: [^391, ^1024], ^2821: [^759, 'w', ^118, ^51], ^2822: [^291, 'o'], ^2823: [^25, ^64], ^2824: [^177, ^2459], ^2825: [^338, ^1946], ^2826: [^186, ^1511], ^2827: [^162, 'i']}

所以我在列表上有索引，引用字典的键（索引）。我想这可以用作可视化的基础结构，如果我错了，请纠正我。数据上的字符是“结束节点/叶子”，它不会引用任何索引。

我找到了可能用于可视化的NetworkX，但我不知道从哪里开始使用它和我的数据。我希望它会像以下一样简单：

import networkx as nx
import matplotlib.pyplot as plt

d = {^2820: [^391, ^1024], ^2821: [^759, 'w', ^118, ^51], ^2822: [^291, 'o'], ^2823: [^25, ^64], ^2824: [^177, ^2459], ^2825: [^338, ^1946], ^2826: [^186, ^1511], ^2827: [^162, 'i']}

G = nx.Graph(d)
nx.draw(G)
plt.show()

我正在寻找分层树形图或某种聚类作为输出。对不起，我不完全确定什么是最好的可视化，可能与此类似：

更新

使用NetworkX实际上非常简单。我提供其他简单的样本数据并寻找答案，如果它可以通过树形图而不是有线网络图形来显示？

# original sequence: a,b,c,d,b,c,a,b,c,d,b,c
d = {^1: ['b', 'c'], ^2: ['a', ^1, 'd', ^1], 'S': [^2, ^2]}
G = nx.Graph(d)
nx.draw_spring(G, node_size=300, with_labels=True)

正如我们所看到的，图表显示了简单的关系，但没有我愿意做的数据的层次和顺序。 DiGraph提供了更多细节，但仍然无法从中构建原始序列：

对于树形图，显然需要按照第一个答案所指出的方式计算重量和末端节点。对于那种方法，数据结构可能是这样的：

d = {'a': [], 'b': [], 'c': [], 'd': [], ^1: ['b', 'c'], ^2: ['a', ^1, 'd', ^1], 'S': [^2, ^2]}

Answer 1

一个想法是使用SciPy's dendrogram function来绘制树形图。为此，您只需创建链接矩阵Z，这在SciPy linkage function的文档中有所描述。链接矩阵[x, y, w, z]的每一行Z描述w和x合并形成带y叶子的带根子树的权重z

为了演示，我使用带有三个叶子的小树形图创建了一个简单的例子，如下所示：

我使用以下代码创建了此可视化：

# Load required modules
import networkx as nx
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram

# Construct the graph/hierarchy
d           = { 0: [1, 'd'], 1: ['a', 'b', 'c'], 'a': [], 'b': [], 'c': [], 'd': []}
G           = nx.DiGraph(d)
nodes       = G.nodes()
leaves      = set( n for n in nodes if G.out_degree(n) == 0 )
inner_nodes = [ n for n in nodes if G.out_degree(n) > 0 ]

# Compute the size of each subtree
subtree = dict( (n, [n]) for n in leaves )
for u in inner_nodes:
    children = set()
    node_list = list(d[u])
    while len(node_list) > 0:
        v = node_list.pop(0)
        children.add( v )
        node_list += d[v]

    subtree[u] = sorted(children & leaves)

inner_nodes.sort(key=lambda n: len(subtree[n])) # <-- order inner nodes ascending by subtree size, root is last

# Construct the linkage matrix
leaves = sorted(leaves)
index  = dict( (tuple([n]), i) for i, n in enumerate(leaves) )
Z = []
k = len(leaves)
for i, n in enumerate(inner_nodes):
    children = d[n]
    x = children[0]
    for y in children[1:]:
        z = tuple(subtree[x] + subtree[y])
        i, j = index[tuple(subtree[x])], index[tuple(subtree[y])]
        Z.append([i, j, float(len(subtree[n])), len(z)]) # <-- float is required by the dendrogram function
        index[z] = k
        subtree[z] = list(z)
        x = z
        k += 1

# Visualize
dendrogram(Z, labels=leaves)
plt.show()

有几个关键项目需要注意：

提供d数据结构，我使用NetworkX有向图（DiGraph）。方向性很重要，因此我们可以确定哪些节点是leaves（没有子节点 - ＆gt; out度为零）和inner_nodes（两个或更多个子节点 - >非零度数）。
通常，树形图中的每条边都会有一些重量，但在您的示例中没有任何权重。相反，我使用以每个内部节点n为根的子树中的叶子数作为n的权重。
如果内部节点有两个以上的子节点，则必须添加＆＃34; dummy＆＃34;内部节点，因为链接矩阵的每一行将两个节点合并在一起。这就是我写for y in children[1:]:等等的原因。

我猜你可能会在你的示例中创建d之前根据数据的样子简化此代码，因此这可能更像是一个概念验证。

如何可视化（树形图）分层项目的字典？

更新

1 个答案: