Question

我想解决相关系数，同时从数据帧中删除一行。在得到所有相关系数之后，我需要删除导致相关系数增加最多的行。

以下代码显示了我的解决方案：

import pandas as pd
import numpy as np

#Access the data

file='tc_yolanda2.csv'
df = pd.read_csv(file)

x = df['dist']
y = df['mps']

#compute the correlation coefficient

def correlation_coefficient_4u(a,b):
    correl_mat = np.corrcoef(a,b)
    correlation = correl_mat[0,1]
    return correlation

c = correlation_coefficient_4u(x,y)
print('Correlation coeffcient is:',c)

#Let us try this one

lenght = len(df)
print(lenght)
a = 0
while lenght != 0:
    df.drop([a], inplace=True)
    c = correlation_coefficient_4u(df.dist,df.mps)
    a += 1
    print(round(c,4))

它已成功生成50个相关系数，但也产生了许多错误，例如

RuntimeWarning: Degrees of freedom <= 0 for slice

RuntimeWarning: divide by zero encountered in double_scalars

RuntimeWarning: invalid value encountered in multiply

RuntimeWarning: Mean of empty slice.

RuntimeWarning: invalid value encountered in true_divide

ValueError: labels [50] not contained in axis

我的下一个问题是如何删除错误以及如何找到具有最高负值的相关系数的索引，以便我可以永久删除该行并重复上述过程。

顺便说一句，这是我的数据。

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 2 columns):
dist    50 non-null float64
mps     50 non-null int64
dtypes: float64(1), int64(1)
memory usage: 880.0 bytes
None

结果：

dist  mps
0   441.6    2
1   385.4    7
2   470.7    1
3   402.2    0
4   361.6    0
5   458.6    3
6   453.9    6
7   425.2    4
8   336.6    8
9   265.4    5
10  207.0    5
11  140.5   28
12  229.9    4
13  175.2    6
14  244.5    2
15  455.7    4
16  396.4   12
17  261.8    7
18  291.5    9
19  233.9    2
20  167.8    9
21   88.9   15
22  110.1   25
23   97.1   15
24  160.4   10
25  344.0    0
26  381.6   21
27  391.9    3
28  314.7    2
29  320.7   14
30  252.9   10
31  323.1   12
32  256.0    6
33  281.6    5
34  280.4    5
35  339.8   10
36  301.9   12
37  381.8    0
38  320.2   10
39  347.6    8
40  301.0    4
41  369.7    6
42  378.4    4
43  446.8    4
44  397.4    3
45  454.2    2
46  475.1    0
47  427.0    8
48  463.4    8
49  464.6    2
Correlation coeffcient is: -0.529328951782
49
-0.5209
-0.5227
-0.5091
-0.4998
-0.4975
-0.4879
-0.4903
-0.4838
-0.4845
-0.4908
-0.5085
-0.4541
-0.4736
-0.4962
-0.5273
-0.5189
-0.5452
-0.5494
-0.5485
-0.5882
-0.5999
-0.5711
-0.4321
-0.3251
-0.296
-0.3214
-0.4595
-0.4516
-0.5018
-0.5
-0.4524
-0.431
-0.4514
-0.4955
-0.5603
-0.5263
-0.385
-0.4764
-0.3229
-0.194
-0.3029
-0.1961
-0.2572
-0.2572
-0.6454
-0.7041
-0.5241
-1.0

Warning (from warnings module):
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\function_base.py", line 3159
    c = cov(x, y, rowvar)
RuntimeWarning: Degrees of freedom <= 0 for slice

Warning (from warnings module):
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\function_base.py", line 3093
    c *= 1. / np.float64(fact)
RuntimeWarning: divide by zero encountered in double_scalars

Warning (from warnings module):
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\function_base.py", line 3093
    c *= 1. / np.float64(fact)
RuntimeWarning: invalid value encountered in multiply
nan

Warning (from warnings module):
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\function_base.py", line 1110
    avg = a.mean(axis)
RuntimeWarning: Mean of empty slice.

Warning (from warnings module):
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\core\_methods.py", line 73
    ret, rcount, out=ret, casting='unsafe', subok=False)
RuntimeWarning: invalid value encountered in true_divide
nan
Traceback (most recent call last):
  File "C:/Users/User/Desktop/CARDS 2017 Research Study/Python/methodology.py", line 28, in <module>
    df.drop([a], inplace=True)
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\generic.py", line 2530, in drop
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\generic.py", line 2562, in _drop_axis
    new_axis = axis.drop(labels, errors=errors)
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexes\base.py", line 3744, in drop
    labels[mask])
ValueError: labels [50] not contained in axis

Answer 1

您可以使用以下代码查找并删除导致相关系数增加最多的行。

length=len(df)
def dropcc(df):
    df_temp=df.copy()
    idxmax=0
    c=0

    for i,v in df_temp.iterrows():
        df_temp.drop([i], inplace=True)
        c_temp = correlation_coefficient_4u(df_temp.dist,df_temp.mps)
        if c > c_temp:
            idxmax=i
            c=c_temp
        df_temp=df.copy()
        #print(round(c_temp,4))

    df.drop([idxmax], inplace=True)
    return df

for i in range(0, length-1):
    cc=correlation_coefficient_4u(df.dist,df.mps)
    if cc < -0.9:
        break
    else:
        df=dropcc(df)

如何定位给定值的索引？

1 个答案: