Question

我想计算两个Pandas DataFrame行之间的相关性。当所有条目都是数字类型时，很容易计算两行的相关性，如下所示：

import pandas as pd
import numpy as np
example_df = pd.DataFrame(np.random.randn(10, 30), np.arange(10))
example_df.iloc[1, :].corr(example_df.iloc[2, :])

但是如果DataFrame是混合类型，即使您只选择数字条目的子集，在计算相关性时也会出错：

example_df['Letter'] = 'A'
example_df.iloc[1, :-1].corr(example_df.iloc[2, :-1])

AttributeError：＆＃39; numpy.float64＆＃39;对象没有属性＆＃39; sqrt＆＃39;

Pearson相关函数利用平方根函数，并且该函数对于对象类型不存在，因此它不能进行相关。您必须手动将类型更改为float，然后才能计算相关性。

example_df.iloc[1, :-1].astype('float64').corr(example_df.iloc[2, :-1].astype('float64'))

有更好的方法吗？

Answer 1

我不知道这些是否比你所做的更好，但这是一种笨拙的方式：

np.corrcoef(df_example.iloc[1:3, :-1])

array([[ 1.        , -0.37194563],
       [-0.37194563,  1.        ]])

这是大熊猫的一种方式：

df_example.iloc[1:3, :-1].T.corr()

          1         2
1  1.000000 -0.371946
2 -0.371946  1.000000

如果要比较非连续行，请像这样调整iloc：

df_example.iloc[[1, 4], :-1].T.corr()

Answer 2

您可以隐藏索引中的非浮点列

example_df = example_df.set_index(['Letter'], append=True)

这样行再次纯粹是float dtype。然后

example_df.iloc[1, :].corr(example_df.iloc[2, :])

像以前一样工作。

在pandas数据帧中进行行相关的正确方法

2 个答案: