我正在尝试将.csv表中的数据缩放到0到1之间的范围。我已经多次收到输入数据包含NaN,无穷大或值太大的错误。
“ ValueError:输入包含NaN,无穷大或对于dtype('float64')而言太大的值。”
直到现在,我始终能够找出错误的来源,例如一个空单元格,有时表中为空白或不兼容UTF-8的字符。到现在为止,我始终能够使其发挥作用。
这次,我再次收到错误,但是我找不到错误。有没有办法找出哪个数据点是“ NaN,无穷大或值太大”?因为我有很多数据点,所以无法手动进行检查。如果您有建议,我会很高兴-即使在 Excel 中查找导致错误的值只是一个技巧。您可以在下面找到我的代码和错误。不幸的是,我无法提供数据集,因为它包含机密信息。
代码:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
# Load training data set from CSV file
training_data_df = pd.read_csv("mtth_train.csv")
# Load testing data set from CSV file
test_data_df = pd.read_csv("mtth_test.csv")
# Data needs to be scaled to a small range like 0 to 1
scaler = MinMaxScaler(feature_range= (0, 1))
# Scale both the training inputs and outputs
scaled_training = scaler.fit_transform(training_data_df)
scaled_testing = scaler.transform(test_data_df)
# Print out the adjustment that the scaler applied to the total_earnings column of data
print("Note: Parameters were scaled by multiplying by {:.10f} and adding {:.6f}".format(scaler.scale_[8], scaler.min_[8]))
# Create new pandas DataFrame objects from the scaled data
scaled_training_df = pd.DataFrame(scaled_training, columns=training_data_df.columns.values)
scaled_testing_df = pd.DataFrame(scaled_testing, columns=test_data_df.columns.values)
# Save scaled data dataframes to new CSV files
scaled_training_df.to_csv("mtth_train_scaled", index=False)
scaled_testing_df.to_csv("mtth_test_scaled.csv", index=False)
错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-4e3503c96698> in <module>()
14 # Scale both the training inputs and outputs
15 scaled_training = scaler.fit_transform(training_data_df)
---> 16 scaled_testing = scaler.transform(test_data_df)
17
18 # Print out the adjustment that the scaler applied to the total_earnings column of data
~/anaconda3_501/lib/python3.6/site-packages/sklearn/preprocessing/data.py in transform(self, X)
365 check_is_fitted(self, 'scale_')
366
--> 367 X = check_array(X, copy=self.copy, dtype=FLOAT_DTYPES)
368
369 X *= self.scale_
~/anaconda3_501/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
451 % (array.ndim, estimator_name))
452 if force_all_finite:
--> 453 _assert_all_finite(array)
454
455 shape_repr = _shape_repr(array.shape)
~/anaconda3_501/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
42 and not np.isfinite(X).all()):
43 raise ValueError("Input contains NaN, infinity"
---> 44 " or a value too large for %r." % X.dtype)
45
46
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
答案 0 :(得分:1)
import numpy as np
indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
df[indices_to_keep]
如果您需要查找NA或inf有多少个值
from collections import Counter
Counter(indices_to_keep)
您也可以在此处关注文档以查找缺少的数据https://pandas.pydata.org/pandas-docs/stable/missing_data.html
根据文档,我们可以将inf值的选项设置为NA
pandas.options.mode.use_inf_as_na = True
然后我们只需查找NA值即可。
import pandas as pd
pd.isna(df)
答案 1 :(得分:0)
使用
df.isnull().sum()
了解每列中缺失值的总数