我正在从某个值以上的pd.DataFrame中提取单元格数据。我将索引,列标题和值存储在元组中。然后将这些元组附加到列表中。数据框的布局我从中获取值意味着我提取每个元素两次,我只需要存储每个组合一次。从阅读以前的人们的努力集(列表)应该给出这些独特的元素,但在模拟数据集上应该产生单个结果('Pathway1','Pathway2',0.6)它报告两个排列。
有谁知道为什么set在这种情况下不起作用?我知道列表需要是相同的,在我看来它们(甚至是每个元组组件的类型(字符串,字符串,浮点数))。出于解压,我试图将浮子强制转换为没有改进的字符串。
为了完整性,给出了大部分代码(简单一点)。底部的块是问题出现的地方。代码如下:
#Import modules
import numpy as np
import pandas as pd
#Define trial sets
s1 = ["A", "B", "C", "D", "E"]
s2 = ["A", "B", "C"]
s3 = ["A", "B", "F"]
s4 = ["A", "B", "G", "H", "I"]
s5 = ["X", "Y", "Z"]
slist = [s1,s2,s3,s4,s5]
#Create an empty list to append results to
result1 = []
#Calculate Jaccard index between every entry
#This is computationally inefficient as most computations are performed twice to generate a full results matrix to make mapping easy. Making half a matrix is more complicated but would be possible within the loop. Empty values would still have to be coded for though so in terms of storage of the final results matrix I don't think there should be much difference
for i in range(len(slist)):
for j in range(len(slist)):
result1.append(len(set(slist[i]).intersection(slist[j]))/len(set(slist[i]).union(slist[j])))
#Define result matrix dimensions
shape = (len(slist), len(slist))
#Convert list to array for numpy
rarray = np.array(result1)
pathway_names = ["Pathway1", "Pathway2", "Pathway3", "Pathway4", "Pathway5"]
dataframe = pd.DataFrame(data = rmatrix, index = pathway_names, columns = pathway_names)
#List all pathways with Jaccard index > x unless PathwayName = PathwayName
x = 0.5
temp =[] #A temporary list for holding lists of tuples which will contain permutations
问题出在:
for k in range(len(slist)):
index = dataframe.index[dataframe.iloc[k]>x]
for l in range(len(index)):
if index[l] != dataframe.columns[k]:
temp.append((index[l], dataframe.columns[k], dataframe.iloc[l,k]))
print(set(temp))
我从打印temp
获得的输出是
{('Pathway1', 'Pathway2', 0.6), ('Pathway2', 'Pathway1', 0.6)}
但我要求(以任何顺序):
('Pathway1', 'Pathway2', 0.6)
感谢您提供的任何帮助,
安格斯
答案 0 :(得分:0)
问题是元组是有序的,因此('Pathway1', 'Pathway2', 0.6)
不等于('Pathway2', 'Pathway1', 0.6)
。
要解决此问题,请将temp
初始化为set
并在添加任何元组之前对其进行排序。
temp = set()
for ...:
...
the_tuple = ...
temp.add(tuple(sorted(the_tuple)))
print(temp)