我的数据格式如下:
每秒,我在 N 秒内将 M 字符串写入列表[ M(i),i = {1,..,N} 不一定等于 M(j),j = {1,..,N | j!= i} ]。我在3个实例中执行此操作。也就是说,每秒创建3个任意数量的字符串列表,总共 N 秒。
现在,我想以可视化方式显示每个列表(每秒)中共有多少个字符串(作为(可能)相关性或相似性矩阵)。我想在所有 N 秒内重复此操作。我不确定该怎么做。
假设 N = 3 ,
# instance 1
I1 = [['cat', 'dog', 'bob'], # 1st second
['eel', 'pug', 'emu'], # 2nd second
['owl', 'yak', 'elk']] # 3rd second
# instance 2
I2 = [['dog', 'fox', 'rat'], # 1st second
['emu', 'pug', 'ram'], # 2nd second
['bug', 'bee', 'bob']] # 3rd second
# instance 3
I3 = [['cat', 'bob', 'fox'], # 1st second
['emu', 'pug', 'eel'], # 2nd second
['bob', 'bee', 'yak']] # 3rd second
在Python的各个实例中,每秒钟每秒可视化常见元素的数量的最佳方法是什么? 附注:我已经可以将其绘制为图形,但是我对创建相关性或相似性矩阵感兴趣。
答案 0 :(得分:1)
您可以遍历并创建自己的相似性矩阵,并使用matplotlib的imshow函数绘制矩阵。对于这种方法,几秒钟内将具有完全相似性,否则,您将需要一个3维相似性矩阵。使用以下代码可以做到这一点,但是除了imshow之外,您还需要找到另一种可视化方法
import numpy as np
import matplotlib.pyplot as plt
# instance 1
I1 = [['cat', 'dog', 'bob'], # 1st second
['eel', 'pug', 'emu'], # 2nd second
['owl', 'yak', 'elk']] # 3rd second
# instance 2
I2 = [['dog', 'fox', 'rat'], # 1st second
['emu', 'pug', 'ram'], # 2nd second
['bug', 'bee', 'bob']] # 3rd second
# instance 3
I3 = [['cat', 'bob', 'fox'], # 1st second
['emu', 'pug', 'eel'], # 2nd second
['bob', 'bee', 'yak']] # 3rd second
total = [I1, I2, I3]
# initialize similarity matrix by number of instances you have
sim_matrix = np.zeros(shape=(len(total), len(total)))
# constant per your explanation
N = 3
# for each row in sim matrix
for i in range(len(total)):
# for each column in sim matrix
for j in range(len(total)):
# if comparing itself
if i == j:
# similarity is total # of strings across all seconds (may not be constant)
sim_matrix[i, j] = sum([len(t) for t in total[i]])
else:
# sum up each set intersection of each list of strings at each second
sim_matrix[i, j] = sum([len(list(set(total[i][s]) & set(total[j][s]))) for s in range(N)])
sim_matrix
应该是
array([[9., 3., 6.],
[3., 9., 5.],
[6., 5., 9.]])
您可以使用imshow
plt.imshow(sim_matrix)
plt.colorbar()
plt.show()
几乎可以肯定,这样做的方式更好,更有效,但是如果您的列表数量很少,那可能很好。
如果您每秒需要相似度矩阵,则可以使用以下修改后的代码
sim_matrix = np.zeros(shape=(len(total), len(total), len(total)))
for i in range(len(total)):
for j in range(len(total)):
if i == j:
sim_matrix[:, i, j] = [len(t) for t in total[i]]
else:
sim_matrix[:, i, j] = [len(list(set(total[i][s]) & set(total[j][s]))) for s in range(N)]
您仍可以使用imshow
来可视化3-d相似度矩阵,但它将每个切片解释为RBG颜色通道。