Question

我想读一个矩阵文件，看起来像：

sample  sample1 sample2 sample3
sample1 1   0.7 0.8
sample2 0.7 1   0.8
sample3 0.8 0.8 1

我想获取值>＆gt;的所有对0.8。例如：sample1,sample3 0.8 sample2,sample3 0.8等在一个大文件中。

当我使用csv.reader时，每一行都会转入一个列表并跟踪行名和列名会使程序变得狡猾。我想知道一种优雅的方式，比如使用numpy或pandas。

期望的输出：

sample1,sample3 0.8 
sample2,sample3 0.8

1可以忽略，因为在同一个样本之间，它总是1。

Answer 1

您可以使用np.triu屏蔽掉上三角形值：

In [11]: df
Out[11]:
         sample1  sample2  sample3
sample
sample1      1.0      0.7      0.8
sample2      0.7      1.0      0.8
sample3      0.8      0.8      1.0

In [12]: np.triu(df, 1)
Out[12]:
array([[ 0. ,  0.7,  0.8],
       [ 0. ,  0. ,  0.8],
       [ 0. ,  0. ,  0. ]])

In [13]: np.triu(df, 1) >= 0.8
Out[13]:
array([[False, False,  True],
       [False, False,  True],
       [False, False, False]], dtype=bool)

然后提取它是真的索引/列我认为你必须使用np.where *：

In [14]: np.where(np.triu(df, 1) >= 0.8)
Out[14]: (array([0, 1]), array([2, 2]))

这为您提供了第一个索引索引和列索引的数组（这是这个numpy版本中效率最低的部分）：

In [16]: index, cols = np.where(np.triu(df, 1) >= 0.8)

In [17]: [(df.index[i], df.columns[j], df.iloc[i, j]) for i, j in zip(index, cols)]
Out[17]:
[('sample1', 'sample3', 0.80000000000000004),
 ('sample2', 'sample3', 0.80000000000000004)]

根据需要。

*我可能忘记了获取最后一个块的更简单方法（编辑：下面的pandas代码可以做到，但我认为可能还有其他方法。）

您可以在pandas中使用相同的技巧，但使用stack来本机获取索引/列：

In [21]: (np.triu(df, 1) >= 0.8) * df
Out[21]:
         sample1  sample2  sample3
sample
sample1        0        0      0.8
sample2        0        0      0.8
sample3        0        0      0.0

In [22]: res = ((np.triu(df, 1) >= 0.8) * df).stack()

In [23]: res
Out[23]:
sample
sample1  sample1    0.0
         sample2    0.0
         sample3    0.8
sample2  sample1    0.0
         sample2    0.0
         sample3    0.8
sample3  sample1    0.0
         sample2    0.0
         sample3    0.0
dtype: float64

In [24]: res[res!=0]
Out[24]:
sample
sample1  sample3    0.8
sample2  sample3    0.8
dtype: float64

Answer 2

如果您想使用Pandas，以下答案将有所帮助。我假设您将自己弄清楚如何将矩阵文件读入Pandas。我还假设你的列和行被正确标记。在您阅读数据后，您最终得到的是一个DataFrame，它看起来很像您在问题顶部的矩阵。我假设所有行名称都是DataFrame索引。我认为你已经将数据读入一个名为df的变量作为我的起点。

Pandas在行方面比在列方面更有效。因此，我们按行进行循环，循环遍历列。

pairs = {}
for col in df.columns:
    pairs[col] = df[(df[col] >= 0.8) & (df[col] < 1)].index.tolist()
    # If row names are not an index, but a different column named 'names' run the following line, instead of the line above
    # pairs[col] = df[(df[col] >= 0.8) & (df[col] < 1)]['names'].tolist()

或者，您可以使用apply()执行此操作，因为它也会遍历所有列。（也许在0.17它将释放GIL以获得更快的结果，我不知道因为我没有尝试过。）

pairs现在将包含列名作为键，并将行名称列表作为相关性大于0.8但小于1的值。

如果您还想从DataFrame中提取相关值，请将.tolist()替换为.to_dict()。 .to_dict()会生成一个dict，索引是键，值是值{index -> value}。因此，最终您的pairs看起来像{column -> {index -> value}}。它也将免于nan。请注意，.to_dict()仅在您的索引包含所需的行名称时才有效，否则它将返回默认索引，这只是数字。

聚苯乙烯。如果您的文件很大，我建议您以大块的形式阅读。在这种情况下，将对每个块重复上面的代码。所以它应该在你的循环中迭代块。但是，您必须小心地将来自下一个块的新数据附加到pairs。以下链接供您参考：

Pandas I/O docs
Pandas read_csv() function
SO question on chunked read

您可能还想阅读Pandas支持的其他类型I / O的参考文献1。

Answer 3

要阅读它，您需要skipinitialspace和index_col参数：

a=pd.read_csv('yourfile.txt',sep=' ',skipinitialspace=True,index_col=0)

要成对地获取值：

[[x,y,round(a[x][y],3)] for x in a.index for y in a.columns if x!=y and a[x][y]>=0.8][:2]

给出：

[['sample1', 'sample3', 0.8], 
['sample2', 'sample3', 0.8]]

Answer 4

使用scipy.sparse.coo_matrix，因为它使用“（row，col）数据”格式。

from scipy.sparse import coo_matrix
import numpy as np

M = np.matrix([[1.0, 0.7, 0.8], [0.7, 1.0, 0.8], [0.8, 0.8, 1.0]])
S = coo_matrix(M)

这里，S.row和S.col是行和列索引的数组，S.data是这些索引处的值数组。所以你可以通过

过滤

idx = S.data >= 0.8

例如，仅使用这些元素创建一个新矩阵：

S2 = coo_matrix((S.data[idx], (S.row[idx], S.col[idx])))
print S2

输出

(0, 0)  1.0
(0, 2)  0.8
(1, 1)  1.0
(1, 2)  0.8
(2, 0)  0.8
(2, 1)  0.8
(2, 2)  1.0

注意（0,1）不会出现，因为值为0.7。

Answer 5

pandas'read_table可以处理sep参数中的正则表达式。

In [19]: !head file.txt
sample  sample1 sample2 sample3
sample1 1   0.7 0.8
sample2 0.7 1   0.8
sample3 0.8 0.8 1

In [20]: df = pd.read_table('file.txt', sep='\s+')

In [21]: df
Out[21]:
    sample  sample1  sample2  sample3
0  sample1      1.0      0.7      0.8
1  sample2      0.7      1.0      0.8
2  sample3      0.8      0.8      1.0

从那里，您可以过滤所有值＆gt; = 0.8。

In [23]: df[df >= 0.8]
Out[23]:
    sample  sample1  sample2  sample3
0  sample1      1.0      NaN      0.8
1  sample2      NaN      1.0      0.8
2  sample3      0.8      0.8      1.0

读取矩阵并在python中获取行和列名称

5 个答案: