Question

我有一个像这样的pandas数据框：

             Al01   BBR60   CA07    NL219
AAEAMEVAT    MP      NaN     MP      MP 
AAFEDLRLL    NaN     NaN     NaN     NaN
AAGAAVKGV    NP      NaN     NP      NP 
ADRGLLRDI    NaN     NP      NaN     NaN 
AEIMKICST    PB1     NaN     NaN     PB1 
AFDERRAGK    NaN     NaN     NP      NP 
AFDERRAGK    NP      NaN     NaN     NaN

有大约一千行和六列。大多数细胞是空的（NaN）。我想知道每列中文本的概率是多少，因为不同的列中有文本。例如，这里的小片段会产生如下内容：

            Al01    BBR60   CA07    NL219
Al01        4       0       2       3
BBR60       0       1       0       0
CA07        2       0       3       3
NL219       3       0       3       4

这表示Al01柱中有4个命中;在这两个命中中，没有一个是BBR60列中的命中，2个也是CA07列中的命中，3个是NL219列中的命中。等等。

我可以单步执行每一列并使用值构建一个dict，但这看起来很笨拙。有更简单的方法吗？

Answer 1

您正在执行的操作可以表示为np.einsum的应用程序 - 它是每对列之间的内部产品：

import numpy as np
import pandas as pd

df = pd.read_table('data', sep='\s+')
print(df)
#   Al01 BBR60 CA07 NL219
# 0   MP   NaN   MP    MP
# 1  NaN   NaN  NaN   NaN
# 2   NP   NaN   NP    NP
# 3  NaN    NP  NaN   NaN
# 4  PB1   NaN  NaN   PB1
# 5  NaN   NaN   NP    NP
# 6   NP   NaN  NaN   NaN

arr = (~df.isnull()).values.astype('int')
print(arr)
# [[1 0 1 1]
#  [0 0 0 0]
#  [1 0 1 1]
#  [0 1 0 0]
#  [1 0 0 1]
#  [0 0 1 1]
#  [1 0 0 0]]

result = pd.DataFrame(np.einsum('ij,ik', arr, arr),
                      columns=df.columns, index=df.columns)
print(result)

产量

       Al01  BBR60  CA07  NL219
Al01      4      0     2      3
BBR60     0      1     0      0
CA07      2      0     3      3
NL219     3      0     3      4

通常，当计算归结为独立于索引的数字运算时，使用NumPy比使用Pandas更快。这似乎是这种情况：

In [130]: %timeit df2 = df.applymap(lambda x: int(not pd.isnull(x))); df2.T.dot(df2) 1000 loops, best of 3: 1.12 ms per loop In [132]: %timeit arr = (~df.isnull()).values.astype('int'); pd.DataFrame(np.einsum('ij,ik', arr, arr), columns=df.columns, index=df.columns) 10000 loops, best of 3: 132 µs per loop

Answer 2

它只是矩阵乘法：

import pandas as pd
df = pd.read_csv('data.csv',index_col=0, delim_whitespace=True)
df2 = df.applymap(lambda x: int(not pd.isnull(x)))
print df2.T.dot(df2)

输出：

           Al01  BBR60  CA07  NL219
Al01      4      0     2      3
BBR60     0      1     0      0
CA07      2      0     3      3
NL219     3      0     3      4

[4 rows x 4 columns]

来自pandas数据帧的成对矩阵

2 个答案: