缩放样本

时间:2017-06-12 03:25:53

标签: python pandas numpy

我参加了udemy课程"深度学习A-Z™:动手人工神经网络"在第一个作业(构建人工神经网络)中,我运行以下代码

# Import the libraries
# Opensource computation numbers GPU enabled
# import Theano as th
# import Tensorflow as tf
# import Keras as ke

# Part 1 Data preprocessing

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Get the data set
dataset = pd.read_csv('Churn_Modelling.csv')
# Pull test sample
testset = pd.read_csv('Churn_Modelling_Test.csv')

# Dividing the dependent and independent variable
# -1 expresses the last column (feature)
x = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values

# Get the independent variables
testx = testset.iloc[:, 3:13].values


# Pre processing
# Concatenate the test rows
x = np.vstack((x,testx))

# Taking care of categorical variables
# Creating encoding for categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_x_country = LabelEncoder()
x[:,1] = labelencoder_x_country.fit_transform(x[:,1])

labelencoder_x_gender = LabelEncoder()
x[:,2] = labelencoder_x_gender.fit_transform(x[:,2])


# Block the ANN trying to assert true value of encoding
# Create dummy variables for the non-dichotomic encoded variable
onehotencoder = OneHotEncoder(categorical_features = [1])
x = onehotencoder.fit_transform(x).toarray()

# Take care of the dummy variable in order not to have more than 2
x = x[:,1:]

prepro_testx = np.array([x[1000,:]])

x = np.delete(x, 10000, 0)

# Splitting the dataset into the training set and test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, 
                                                    random_state = 0)

# Scaling the features (columns) to standardize or normalize values
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

# Scale it 
prepro_testx = sc_x.transform(prepro_testx)

# Part 2 Making the ANN

# Importing the libraries
import keras as ke
# Start de ANN
from keras.models import Sequential
# Make the layers
from keras.layers import Dense

# Initialize the ANN you can define a graph or a sequence of layers
# Create classifier
# Classifier is the model of the neural network
classifier = Sequential()

# Construct the layers with the input layer and first hidden layer
# The number of nodes will be the number of independent variables
# For the hidden layers, we will use the rectifier function
# For the outer layers, we will use the sigmoid function
# The output is 0 or 1 so only one node
aux = ((x[1].size + 1)/2)
# Api 2
classifier.add(Dense(activation = 'relu', input_dim = x[1].size, units = int(aux),
                     kernel_initializer='uniform'))
# Api 1
#classifier.add(Dense(output_dim = int(aux), init = 'uniform', 
#                     activation = 'relu', input_dim = x[1].size))

# Add the second hidden layer
classifier.add(Dense(activation = 'relu', units = int(aux),
                     kernel_initializer='uniform'))

# Add the output layer
classifier.add(Dense(activation = 'sigmoid', units = 1,
                     kernel_initializer='uniform'))

# Compiling the ANN (Applying SGD to the optimizer with adam mode)
# SGD = optimizer, Loss function = SUM(y-y`)^2->min but not in this case
# The activation is a sigmoid and not a linear regression, we use Log loss
# For two output categories use binary_crossentropy else categorical_...
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy',
                   metrics = ['accuracy'])

# Fitting the classifier (ANN) to the Training Set
classifier.fit(x_train, y_train, batch_size = 10, epochs = 100)

# Predicting the Test set results
y_pred = classifier.predict(x_test)
y_pred = (y_pred > 0.5)

new_pred = classifier.predict(prepro_testx)
new_pred = (new_pred > 0.5)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

对于预测集合,它会运行并抛出一个False。但是,在本教程中,他们不会从额外的csv中提取,他们只是将硬编码值放入

new_pred = classifier.predict(sc_x.transform(np.array([[1, 0, 416, 0, 41, 10, 122189.66,2,1,0,98301.61]])))

new_pred = (new_pred > .50)

两个代码都会运行并返回false,但sc_x.transform(np.array([[1, 0, 416, 0, 41, 10, 122189.66,2,1,0,98301.61]]))prepro_testx的缩放值会有所不同。

为什么会这样?

我的不同之处是:

添加新的CSV,以防我想添加几个额外的样本而不只是一个。 将新CSV附加到主采样。 对所有数据进行编码/缩放,以便以相同的方式完成。 在拆分之前检索附加的额外样本。 分类

他们做的是: 使用sc_x

查看值,硬编码和比例变换

有些值相似但有些则不相似。 哪一个是正确的?

0 个答案:

没有答案