我想为我读过的pandas数据框运行以下模型(逻辑回归)。
但是,当预测方法出现时,它会说:“输入包含NaN,无穷大或者对于dtype来说太大('float64')”
我的代码是:(注意,必须存在10个数字和4个分类变量)
import pandas as pd
import io
import requests
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
heart = pd.DataFrame(s_rows_cols, columns = header_row, index=range(271))
pd.to_numeric(heart['age'])
pd.to_numeric(heart['restBP'])
pd.to_numeric(heart['chol'])
pd.to_numeric(heart['sugar'])
pd.to_numeric(heart['maxhr'])
pd.to_numeric(heart['angina'])
pd.to_numeric(heart['dep'])
pd.to_numeric(heart['fluor'])
heart['chestpain'] = heart['chestpain'].astype('category')
heart['ecg'] = heart['ecg'].astype('category')
heart['thal'] = heart['thal'].astype('category')
heart['exercise'] = heart['exercise'].astype('category')
x = pd.to_numeric(heart['diagnosis'])
heart['diagnosis'] = (x > 1).astype(int)
heart_train, heart_test, goal_train, goal_test = train_test_split(heart.loc[:,'age':'thal'], heart.loc[:,'diagnosis'], test_size=0.3, random_state=0)
clf = LogisticRegression()
clf.fit(heart_train, goal_train)
heart_test_results = clf.predict(heart_test) #From here it is broken
print(clf.get_params(clf))
print(clf.score(heart_train,goal_train))
数据框信息如下(print(heart.info()):
RangeIndex: 271 entries, 0 to 270
Data columns (total 14 columns):
age 270 non-null object
sex 270 non-null object
chestpain 270 non-null category
restBP 270 non-null object
chol 270 non-null object
sugar 270 non-null object
ecg 270 non-null category
maxhr 270 non-null object
angina 270 non-null object
dep 270 non-null object
exercise 270 non-null category
fluor 270 non-null object
thal 270 non-null category
diagnosis 271 non-null int32
dtypes: category(4), int32(1), object(9)
memory usage: 21.4+ KB
None
有谁知道我在这里缺少什么?
提前致谢!!
答案 0 :(得分:0)
我猜这个错误的原因是你如何解析这些数据:
In [116]: %paste
s = requests.get(url).content
s = s.decode('utf-8')
s_rows = s.split('\n')
s_rows_cols = [each.split() for each in s_rows]
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
heart = pd.DataFrame(s_rows_cols, columns = header_row, index=range(271))
pd.to_numeric(heart['age'])
pd.to_numeric(heart['restBP'])
pd.to_numeric(heart['chol'])
pd.to_numeric(heart['sugar'])
pd.to_numeric(heart['maxhr'])
pd.to_numeric(heart['angina'])
pd.to_numeric(heart['dep'])
pd.to_numeric(heart['fluor'])
heart['chestpain'] = heart['chestpain'].astype('category')
heart['ecg'] = heart['ecg'].astype('category')
heart['thal'] = heart['thal'].astype('category')
heart['exercise'] = heart['exercise'].astype('category')
## -- End pasted text --
In [117]: heart
Out[117]:
age sex chestpain restBP chol sugar ecg maxhr angina dep exercise fluor thal diagnosis
0 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4 2.0 3.0 3.0 2
1 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6 2.0 0.0 7.0 1
2 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3 1.0 0.0 7.0 2
3 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2 2.0 1.0 7.0 1
4 74.0 0.0 2.0 120.0 269.0 0.0 2.0 121.0 1.0 0.2 1.0 1.0 3.0 1
5 65.0 1.0 4.0 120.0 177.0 0.0 0.0 140.0 0.0 0.4 1.0 0.0 7.0 1
6 56.0 1.0 3.0 130.0 256.0 1.0 2.0 142.0 1.0 0.6 2.0 1.0 6.0 2
7 59.0 1.0 4.0 110.0 239.0 0.0 2.0 142.0 1.0 1.2 2.0 1.0 7.0 2
8 60.0 1.0 4.0 140.0 293.0 0.0 2.0 170.0 0.0 1.2 2.0 2.0 7.0 2
9 63.0 0.0 4.0 150.0 407.0 0.0 2.0 154.0 0.0 4.0 2.0 3.0 7.0 2
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
261 60.0 1.0 4.0 130.0 206.0 0.0 2.0 132.0 1.0 2.4 2.0 2.0 7.0 2
262 58.0 1.0 2.0 120.0 284.0 0.0 2.0 160.0 0.0 1.8 2.0 0.0 3.0 2
263 49.0 1.0 2.0 130.0 266.0 0.0 0.0 171.0 0.0 0.6 1.0 0.0 3.0 1
264 48.0 1.0 2.0 110.0 229.0 0.0 0.0 168.0 0.0 1.0 3.0 0.0 7.0 2
265 52.0 1.0 3.0 172.0 199.0 1.0 0.0 162.0 0.0 0.5 1.0 0.0 7.0 1
266 44.0 1.0 2.0 120.0 263.0 0.0 0.0 173.0 0.0 0.0 1.0 0.0 7.0 1
267 56.0 0.0 2.0 140.0 294.0 0.0 2.0 153.0 0.0 1.3 2.0 0.0 3.0 1
268 57.0 1.0 4.0 140.0 192.0 0.0 0.0 148.0 0.0 0.4 2.0 0.0 6.0 1
269 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 2
270 None None NaN None None None NaN None None None NaN None NaN None
[271 rows x 14 columns]
注意:注意NaN的最后一行
尝试以这种简化的方式来做:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
header_row = ['age','sex','chestpain','restBP','chol','sugar','ecg','maxhr','angina','dep','exercise','fluor','thal','diagnosis']
In [118]: df = pd.read_csv(url, sep='\s+', header=None, names=header_row)
In [119]: df
Out[119]:
age sex chestpain restBP chol sugar ecg maxhr angina dep exercise fluor thal diagnosis
0 70.0 1.0 4.0 130.0 322.0 0.0 2.0 109.0 0.0 2.4 2.0 3.0 3.0 2
1 67.0 0.0 3.0 115.0 564.0 0.0 2.0 160.0 0.0 1.6 2.0 0.0 7.0 1
2 57.0 1.0 2.0 124.0 261.0 0.0 0.0 141.0 0.0 0.3 1.0 0.0 7.0 2
3 64.0 1.0 4.0 128.0 263.0 0.0 0.0 105.0 1.0 0.2 2.0 1.0 7.0 1
4 74.0 0.0 2.0 120.0 269.0 0.0 2.0 121.0 1.0 0.2 1.0 1.0 3.0 1
5 65.0 1.0 4.0 120.0 177.0 0.0 0.0 140.0 0.0 0.4 1.0 0.0 7.0 1
6 56.0 1.0 3.0 130.0 256.0 1.0 2.0 142.0 1.0 0.6 2.0 1.0 6.0 2
7 59.0 1.0 4.0 110.0 239.0 0.0 2.0 142.0 1.0 1.2 2.0 1.0 7.0 2
8 60.0 1.0 4.0 140.0 293.0 0.0 2.0 170.0 0.0 1.2 2.0 2.0 7.0 2
9 63.0 0.0 4.0 150.0 407.0 0.0 2.0 154.0 0.0 4.0 2.0 3.0 7.0 2
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
260 58.0 0.0 3.0 120.0 340.0 0.0 0.0 172.0 0.0 0.0 1.0 0.0 3.0 1
261 60.0 1.0 4.0 130.0 206.0 0.0 2.0 132.0 1.0 2.4 2.0 2.0 7.0 2
262 58.0 1.0 2.0 120.0 284.0 0.0 2.0 160.0 0.0 1.8 2.0 0.0 3.0 2
263 49.0 1.0 2.0 130.0 266.0 0.0 0.0 171.0 0.0 0.6 1.0 0.0 3.0 1
264 48.0 1.0 2.0 110.0 229.0 0.0 0.0 168.0 0.0 1.0 3.0 0.0 7.0 2
265 52.0 1.0 3.0 172.0 199.0 1.0 0.0 162.0 0.0 0.5 1.0 0.0 7.0 1
266 44.0 1.0 2.0 120.0 263.0 0.0 0.0 173.0 0.0 0.0 1.0 0.0 7.0 1
267 56.0 0.0 2.0 140.0 294.0 0.0 2.0 153.0 0.0 1.3 2.0 0.0 3.0 1
268 57.0 1.0 4.0 140.0 192.0 0.0 0.0 148.0 0.0 0.4 2.0 0.0 6.0 1
269 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 2
[270 rows x 14 columns]
还要注意自动解析(猜到)的dtypes - pd.read_csv()
会为你做所有必要的转换:
In [120]: df.dtypes
Out[120]:
age float64
sex float64
chestpain float64
restBP float64
chol float64
sugar float64
ecg float64
maxhr float64
angina float64
dep float64
exercise float64
fluor float64
thal float64
diagnosis int64
dtype: object
答案 1 :(得分:0)
我怀疑这是train_test_split的事情。
我建议您将X和y转换为numpy数组,以避免出现此问题。这通常可以解决此问题。
X = heart.loc[:,'age':'thal'].as_matrix()
y = heart.loc[:,'diagnosis'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size, random_state)
然后适合 clf.fit(X_train,y_train)