我有一个名为Flights.py
的程序的代码片段...
#Load the Dataset
df = dataset
df.isnull().any()
df = df.fillna(lambda x: x.median())
# Define X and Y
X = df.iloc[:, 2:124].values
y = df.iloc[:, 136].values
X_tolist = X.tolist()
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
倒数第二行是抛出以下错误:
Traceback (most recent call last):
File "<ipython-input-14-d4add2ccf5ab>", line 3, in <module>
X_train = sc.fit_transform(X_train)
File "/Users/<username>/anaconda/lib/python3.6/site-packages/sklearn/base.py", line 494, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/Users/<username>/anaconda/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 560, in fit
return self.partial_fit(X, y)
File "/Users/<username>/anaconda/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 583, in partial_fit
estimator=self, dtype=FLOAT_DTYPES)
File "/Users/<username>/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
TypeError: float() argument must be a string or a number, not 'function'
我的数据框df
的大小(22587,138)
我正在看下面的问题寻找灵感:
TypeError: float() argument must be a string or a number, not 'method' in Geocoder
我尝试了以下调整:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train.as_matrix)
X_test = sc.transform(X_test.as_matrix)
导致以下错误:
AttributeError: 'numpy.ndarray' object has no attribute 'as_matrix'
我目前在如何通过数据框扫描并查找/转换有问题的条目时感到茫然。
答案 0 :(得分:2)
正如this answer所解释的那样,fillna
并非设计用于回调。如果您传递一个,它将被视为文字填充值,这意味着您的NaN
将被替换为lambdas:
df
col1 col2 col3 col4
row1 65.0 24 47.0 NaN
row2 33.0 48 NaN 89.0
row3 NaN 34 67.0 NaN
row4 24.0 12 52.0 17.0
df4.fillna(lambda x: x.median())
col1 col2 \
row1 65 24
row2 33 48
row3 <function <lambda> at 0x10bc47730> 34
row4 24 12
col3 col4
row1 47 <function <lambda> at 0x10bc47730>
row2 <function <lambda> at 0x10bc47730> 89
row3 67 <function <lambda> at 0x10bc47730>
row4 52 17
如果您尝试按中位数填充,解决方案是根据列创建中位数据框,并将其传递给fillna
。
df
col1 col2 col3 col4
row1 65.0 24 47.0 NaN
row2 33.0 48 NaN 89.0
row3 NaN 34 67.0 NaN
row4 24.0 12 52.0 17.0
df.fillna(df.median())
df
col1 col2 col3 col4
row1 65.0 24 47.0 53.0
row2 33.0 48 52.0 89.0
row3 33.0 34 67.0 53.0
row4 24.0 12 52.0 17.0
答案 1 :(得分:0)
df = df.fillna(lambda x: x.median())
这不是使用fillna
的有效方法。它需要这里的文字值,或从列到文字值的映射。它不适用于您提供的功能;相反,NA单元格的值将简单地设置为函数本身。这是您的估算工具试图变成浮点数的函数。
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html
答案 2 :(得分:0)
我使用df = df.fillna(lambda x: x.median())
遇到了同样的麻烦
这是我的解决方案,以获取真正的价值而不是“功能”到数据框:
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
我使用nan
创建10行数据帧,3个colunmsdf = pd.DataFrame(np.random.randint(100,size=(10,3)))
df.iloc[3:5,0] = np.nan
df.iloc[4:6,1] = np.nan
df.iloc[5:8,2] = np.nan
属性愚蠢的列标签以方便以后
df.columns=['Number_of_Holy_Hand_Grenades_of_Antioch', 'Number_of_knight_fleeings', 'Number_of_rabbits_of_Caerbannog']
print df.isnull().any() # tell if nan per column
对于通过其标签的每个列,我们通过在列本身上计算的中值填充所有nan值。可以与mean()等一起使用
for i in df.columns: #df.columns[w:] if you have w column of line description
df[i] = df[i].fillna(df[i].median() )
print df.isnull().any()
现在df包含由中值
替换的nanprint df
你可以做例如
X = df.ix[:,:].values
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
不适用于df = df.fillna(lambda x: x.median())
我们现在可以将df用于forward方法,因为所有值都是真值,而不是函数;与使用lambda到dataframe.fillna()的方法相反,例如,all proposals using fillna combined to lambda