我想解决相关系数,同时从数据帧中删除一行。在得到所有相关系数之后,我需要删除导致相关系数增加最多的行。
以下代码显示了我的解决方案:
import pandas as pd
import numpy as np
#Access the data
file='tc_yolanda2.csv'
df = pd.read_csv(file)
x = df['dist']
y = df['mps']
#compute the correlation coefficient
def correlation_coefficient_4u(a,b):
correl_mat = np.corrcoef(a,b)
correlation = correl_mat[0,1]
return correlation
c = correlation_coefficient_4u(x,y)
print('Correlation coeffcient is:',c)
#Let us try this one
lenght = len(df)
print(lenght)
a = 0
while lenght != 0:
df.drop([a], inplace=True)
c = correlation_coefficient_4u(df.dist,df.mps)
a += 1
print(round(c,4))
它已成功生成50个相关系数,但也产生了许多错误,例如
RuntimeWarning: Degrees of freedom <= 0 for slice
RuntimeWarning: divide by zero encountered in double_scalars
RuntimeWarning: invalid value encountered in multiply
RuntimeWarning: Mean of empty slice.
RuntimeWarning: invalid value encountered in true_divide
ValueError: labels [50] not contained in axis
我的下一个问题是如何删除错误以及如何找到具有最高负值的相关系数的索引,以便我可以永久删除该行并重复上述过程。
顺便说一句,这是我的数据。
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 2 columns):
dist 50 non-null float64
mps 50 non-null int64
dtypes: float64(1), int64(1)
memory usage: 880.0 bytes
None
结果:
dist mps
0 441.6 2
1 385.4 7
2 470.7 1
3 402.2 0
4 361.6 0
5 458.6 3
6 453.9 6
7 425.2 4
8 336.6 8
9 265.4 5
10 207.0 5
11 140.5 28
12 229.9 4
13 175.2 6
14 244.5 2
15 455.7 4
16 396.4 12
17 261.8 7
18 291.5 9
19 233.9 2
20 167.8 9
21 88.9 15
22 110.1 25
23 97.1 15
24 160.4 10
25 344.0 0
26 381.6 21
27 391.9 3
28 314.7 2
29 320.7 14
30 252.9 10
31 323.1 12
32 256.0 6
33 281.6 5
34 280.4 5
35 339.8 10
36 301.9 12
37 381.8 0
38 320.2 10
39 347.6 8
40 301.0 4
41 369.7 6
42 378.4 4
43 446.8 4
44 397.4 3
45 454.2 2
46 475.1 0
47 427.0 8
48 463.4 8
49 464.6 2
Correlation coeffcient is: -0.529328951782
49
-0.5209
-0.5227
-0.5091
-0.4998
-0.4975
-0.4879
-0.4903
-0.4838
-0.4845
-0.4908
-0.5085
-0.4541
-0.4736
-0.4962
-0.5273
-0.5189
-0.5452
-0.5494
-0.5485
-0.5882
-0.5999
-0.5711
-0.4321
-0.3251
-0.296
-0.3214
-0.4595
-0.4516
-0.5018
-0.5
-0.4524
-0.431
-0.4514
-0.4955
-0.5603
-0.5263
-0.385
-0.4764
-0.3229
-0.194
-0.3029
-0.1961
-0.2572
-0.2572
-0.6454
-0.7041
-0.5241
-1.0
Warning (from warnings module):
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\function_base.py", line 3159
c = cov(x, y, rowvar)
RuntimeWarning: Degrees of freedom <= 0 for slice
Warning (from warnings module):
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\function_base.py", line 3093
c *= 1. / np.float64(fact)
RuntimeWarning: divide by zero encountered in double_scalars
Warning (from warnings module):
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\function_base.py", line 3093
c *= 1. / np.float64(fact)
RuntimeWarning: invalid value encountered in multiply
nan
Warning (from warnings module):
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\function_base.py", line 1110
avg = a.mean(axis)
RuntimeWarning: Mean of empty slice.
Warning (from warnings module):
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\core\_methods.py", line 73
ret, rcount, out=ret, casting='unsafe', subok=False)
RuntimeWarning: invalid value encountered in true_divide
nan
Traceback (most recent call last):
File "C:/Users/User/Desktop/CARDS 2017 Research Study/Python/methodology.py", line 28, in <module>
df.drop([a], inplace=True)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\generic.py", line 2530, in drop
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\generic.py", line 2562, in _drop_axis
new_axis = axis.drop(labels, errors=errors)
File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexes\base.py", line 3744, in drop
labels[mask])
ValueError: labels [50] not contained in axis
答案 0 :(得分:2)
您可以使用以下代码查找并删除导致相关系数增加最多的行。
length=len(df)
def dropcc(df):
df_temp=df.copy()
idxmax=0
c=0
for i,v in df_temp.iterrows():
df_temp.drop([i], inplace=True)
c_temp = correlation_coefficient_4u(df_temp.dist,df_temp.mps)
if c > c_temp:
idxmax=i
c=c_temp
df_temp=df.copy()
#print(round(c_temp,4))
df.drop([idxmax], inplace=True)
return df
for i in range(0, length-1):
cc=correlation_coefficient_4u(df.dist,df.mps)
if cc < -0.9:
break
else:
df=dropcc(df)