我需要几个大型数据集来查找它们之间的相关性。数据被转换为熊猫数据框,我使用pd.DataFrame.corr()查找相关性。它适用于某些数据集而不适用于其他数据集,我不确定为什么。
不起作用的数据集中的值不相同,因此S.D不为0。 dataFrame对象的列类型(dtype)都是float64。
该代码适用于:
BPM1401-01:x BPM1401-01:y
2019-07-23 05:59:59.641471863 0.000052 -0.000108
2019-07-23 06:00:00.033471822 0.000050 -0.000108
2019-07-23 06:00:00.425471783 NaN -0.000108
2019-07-23 06:00:00.816471815 0.000051 NaN
2019-07-23 06:00:01.170471907 0.000050 NaN
2019-07-23 06:00:01.954471827 0.000049 NaN
2019-07-23 06:00:02.345471859 0.000051 NaN
2019-07-23 06:00:02.737471819 0.000051 -0.000108
2019-07-23 06:00:03.090471745 0.000052 -0.000108
2019-07-23 06:00:03.481471777 0.000051 -0.000109
但不适用于:
SR1:BPMXRMSGlobal SR1:BPMYRMSGlobal
2019-07-23 05:59:58.197318077 1.096721 NaN
2019-07-23 05:59:58.197477102 NaN 1.586067
2019-07-23 06:00:01.471035957 NaN 0.772168
2019-07-23 06:00:02.132909060 1.553643 NaN
2019-07-23 06:00:02.132987022 NaN 1.209081
2019-07-23 06:00:02.793922901 2.558707 NaN
2019-07-23 06:00:02.793971062 NaN 1.624215
2019-07-23 06:00:03.440277100 2.508732 NaN
2019-07-23 06:00:03.440378904 NaN 1.540483
2019-07-23 06:00:04.094022036 2.325517 NaN
import pandas as pd
import seaborn as sb
import numpy as np
#Align the data using the timestamps, already done in the above sets.
def align_dataframes(data_frame_list):
#Set progress to initial dataframe
curr_df = data_frame_list[0]
#Align all dataframes together and join
for i in range(len(data_frame_list)-1):
curr_df = curr_df.join(data_frame_list[i+1], how = 'outer')
#Return aligned dataframe
return(curr_df)
def plot_corr(data_frame):
print(data_frame.dtypes) -> gives float64
#Compute correlation matrix
corr_mat = data_frame.corr(method = 'pearson',min_periods=1)
heat_map = sb.heatmap(corr_mat, linewidths = .5)
plt.show()
在我看来,第二个dataFrame应该也能正常工作,但是corr()矩阵最终返回NaN值。
答案 0 :(得分:0)
第二个数据帧中没有两个值都不为null的行,因此没有要在其上找到相关性的数据点