在这种情况下,我有信用卡交易' dataset
包括标记为欺诈或正常交易的284807条记录(变量'类'代表交易的标签)。好吧,我在python中使用tensorflow
来实现Multi-layer Perceptron
(遵循我在网上找到的开源教程),以构建二进制分类器来预测某个事务是否是是否存在欺诈案件。为此,我将数据分成70%30%的训练和验证集。在运行模型仅10个时代后,我获得了99.8%的准确率!这对我来说似乎是一个不合逻辑的结果。我还认识到测试集没有被使用,而训练只是在最后用于计算准确性。那么有人可以帮助我理解这里的准确性代表什么,我怎样才能计算出模型的最终精度?
下面是我的代码和结果。
import sys
import os.path
import pandas as pd
import numpy as np
import tensorflow as tf
from datetime import datetime
from sklearn.model_selection import train_test_split
# Choose device from cmd line. Options: gpu or cpu
device_name = sys.argv[1]
if device_name == 'gpu':
device_name = '/gpu:0'
else:
device_name = '/cpu:0'
tf.device(device_name)
print('Device used:' + str(device_name))
datasetCSVPath = sys.argv[2]
parentPath = os.path.abspath(os.path.join(datasetCSVPath, os.pardir))
print(datasetCSVPath)
print(parentPath)
print('Reading dataset...')
df = pd.read_csv(datasetCSVPath, index_col=False)
print('Adding Class inverse column...')
clsInv = [ 1 - row.Class for index, row in df.iterrows() ]
df = df.assign(classInverse = clsInv)
print('')
print('Raw Data: \n------------------------------ \n')
print(df.head())
print('')
print('Whole dataset size: ' + str(len(df)))
print('')
print('Data Discription\n------------------------------ \n')
print(df.describe())
print('')
X = pd.DataFrame(df,columns=['Time','V1','V2','V3','V4','V5','V6','V7','V8','V9','V10','V11','V12','V13','V14','V15','V16','V17','V18','V19','V20','V21','V22','V23','V24','V25','V26','V27','V28','Amount'])
print('Features Vector:\n------------------------------ \n')
print(X.head())
print('')
Y = pd.DataFrame(df,columns=['Class','classInverse'])
print('Class Vector:\n------------------------------ \n')
print(Y.head())
print('')
print('Splitting Features and Class vectors into training and validation sets by 70/30 ratio...')
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.3)
print('Train features set size: ' + str(len(X_train)))
print('Train Class size: ' + str(len(Y_train)))
print('Test features size: ' + str(len(X_test)))
print('Test Class size: ' + str(len(Y_test)))
# Parameters
learning_rate = 0.001
training_epochs = 10
batch_size = 100
display_step = 1
# Network Parameters
n_hidden_1 = 10 # 1st layer number of features
n_hidden_2 = 10 # 2nd layer number of features
n_input = 30 # Number of feature
n_classes = 2 # Number of classes to predict
# tf Graph input
# x = tf.placeholder(tf.float32, [101, n_input])
# y = tf.placeholder(tf.float32)
x = tf.placeholder(tf.float32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_classes])
# Create model
def multilayer_perceptron(x, weights, biases):
# Hidden layer with RELU activation
layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
layer_1 = tf.nn.relu(layer_1)
# Hidden layer with RELU activation
layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
layer_2 = tf.nn.relu(layer_2)
# Output layer with linear activation
out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
return out_layer
# Store layers weight & bias
weights = {
'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_hidden_2, n_classes]))
}
biases = {
'b1': tf.Variable(tf.random_normal([n_hidden_1])),
'b2': tf.Variable(tf.random_normal([n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_classes]))
}
# Construct model
pred = multilayer_perceptron(x, weights, biases)
# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=pred, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
# Initializing the variables
init = tf.global_variables_initializer()
# Implement get_GPU_tensorflow_session method, to use CPU just return tf.Session() instead
def get_session():
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
return tf.Session(config=config)
print('')
startTime = datetime.now()
print("Started at: ", startTime)
saver = tf.train.Saver()
# Launch the graph
with get_session() as sess:
#sess.run(init)
saver.restore(sess, parentPath + "/model.ckpt")
# Training cycle
for epoch in range(training_epochs):
avg_cost = 0.
total_batch = int(len(X_train)/batch_size)
X_batches = np.array_split(X_train, total_batch)
Y_batches = np.array_split(Y_train, total_batch)
# Loop over all batches
for i in range(total_batch):
batch_x, batch_y = X_batches[i], Y_batches[i]
#print(batch_x)
#print(batch_y)
# Run optimization op (backprop) and cost op (to get loss value)
_, c = sess.run([optimizer, cost], feed_dict={x: batch_x,
y: batch_y})
# Compute average loss
avg_cost += c / total_batch
# Display logs per epoch step
if epoch % display_step == 0:
print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(avg_cost))
save_path = saver.save(sess, parentPath + "/model.ckpt")
print("Model saved in path: %s" % save_path)
print("Optimization Finished!")
# Test model
correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
# Calculate accuracy
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float" ))
print("Accuracy:", accuracy.eval({x: X_test, y: Y_test}))
print("Time taken:", datetime.now() - startTime)
global result
result = tf.argmax(pred, 1).eval({x: X_test, y: Y_test})
这是输出:
C:\Users\MOHAMMAD>python C:\Users\MOHAMMAD\Documents\Fraud\CreditCard\Trained.py cpu C:\Users\MOHAMMAD\Documents\Fraud\CreditCard\CC_All.csv
Device used:/cpu:0
C:\Users\MOHAMMAD\Documents\Fraud\CreditCard\CC_All.csv
C:\Users\MOHAMMAD\Documents\Fraud\CreditCard
Reading dataset...
Adding Class inverse column...
Raw Data:
------------------------------
Unnamed: 0 Time V1 V2 V3 V4 V5 \
0 1 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321
1 2 0.0 1.191857 0.266151 0.166480 0.448154 0.060018
2 3 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198
3 4 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309
4 5 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193
V6 V7 V8 ... V22 V23 V24 \
0 0.462388 0.239599 0.098698 ... 0.277838 -0.110474 0.066928
1 -0.082361 -0.078803 0.085102 ... -0.638672 0.101288 -0.339846
2 1.800499 0.791461 0.247676 ... 0.771679 0.909412 -0.689281
3 1.247203 0.237609 0.377436 ... 0.005274 -0.190321 -1.175575
4 0.095921 0.592941 -0.270533 ... 0.798278 -0.137458 0.141267
V25 V26 V27 V28 Amount Class classInverse
0 0.128539 -0.189115 0.133558 -0.021053 149.62 0 1.0
1 0.167170 0.125895 -0.008983 0.014724 2.69 0 1.0
2 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0 1.0
3 0.647376 -0.221929 0.062723 0.061458 123.50 0 1.0
4 -0.206010 0.502292 0.219422 0.215153 69.99 0 1.0
[5 rows x 33 columns]
Whole dataset size: 284807
Data Discription
------------------------------
Unnamed: 0 Time V1 V2 V3 \
count 284807.000000 284807.000000 2.848070e+05 2.848070e+05 2.848070e+05
mean 142404.000000 94813.859575 1.165980e-15 3.416908e-16 -1.373150e-15
std 82216.843396 47488.145955 1.958696e+00 1.651309e+00 1.516255e+00
min 1.000000 0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01
25% 71202.500000 54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01
50% 142404.000000 84692.000000 1.810880e-02 6.548556e-02 1.798463e-01
75% 213605.500000 139320.500000 1.315642e+00 8.037239e-01 1.027196e+00
max 284807.000000 172792.000000 2.454930e+00 2.205773e+01 9.382558e+00
V4 V5 V6 V7 V8 \
count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean 2.086869e-15 9.604066e-16 1.490107e-15 -5.556467e-16 1.177556e-16
std 1.415869e+00 1.380247e+00 1.332271e+00 1.237094e+00 1.194353e+00
min -5.683171e+00 -1.137433e+02 -2.616051e+01 -4.355724e+01 -7.321672e+01
25% -8.486401e-01 -6.915971e-01 -7.682956e-01 -5.540759e-01 -2.086297e-01
50% -1.984653e-02 -5.433583e-02 -2.741871e-01 4.010308e-02 2.235804e-02
75% 7.433413e-01 6.119264e-01 3.985649e-01 5.704361e-01 3.273459e-01
max 1.687534e+01 3.480167e+01 7.330163e+01 1.205895e+02 2.000721e+01
... V22 V23 V24 V25 \
count ... 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean ... -3.444850e-16 2.578648e-16 4.471968e-15 5.340915e-16
std ... 7.257016e-01 6.244603e-01 6.056471e-01 5.212781e-01
min ... -1.093314e+01 -4.480774e+01 -2.836627e+00 -1.029540e+01
25% ... -5.423504e-01 -1.618463e-01 -3.545861e-01 -3.171451e-01
50% ... 6.781943e-03 -1.119293e-02 4.097606e-02 1.659350e-02
75% ... 5.285536e-01 1.476421e-01 4.395266e-01 3.507156e-01
max ... 1.050309e+01 2.252841e+01 4.584549e+00 7.519589e+00
V26 V27 V28 Amount Class \
count 2.848070e+05 2.848070e+05 2.848070e+05 284807.000000 284807.000000
mean 1.687098e-15 -3.666453e-16 -1.220404e-16 88.349619 0.001727
std 4.822270e-01 4.036325e-01 3.300833e-01 250.120109 0.041527
min -2.604551e+00 -2.256568e+01 -1.543008e+01 0.000000 0.000000
25% -3.269839e-01 -7.083953e-02 -5.295979e-02 5.600000 0.000000
50% -5.213911e-02 1.342146e-03 1.124383e-02 22.000000 0.000000
75% 2.409522e-01 9.104512e-02 7.827995e-02 77.165000 0.000000
max 3.517346e+00 3.161220e+01 3.384781e+01 25691.160000 1.000000
classInverse
count 284807.000000
mean 0.998273
std 0.041527
min 0.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 1.000000
[8 rows x 33 columns]
Features Vector:
------------------------------
Time V1 V2 V3 V4 V5 V6 V7 \
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941
V8 V9 ... V20 V21 V22 V23 \
0 0.098698 0.363787 ... 0.251412 -0.018307 0.277838 -0.110474
1 0.085102 -0.255425 ... -0.069083 -0.225775 -0.638672 0.101288
2 0.247676 -1.514654 ... 0.524980 0.247998 0.771679 0.909412
3 0.377436 -1.387024 ... -0.208038 -0.108300 0.005274 -0.190321
4 -0.270533 0.817739 ... 0.408542 -0.009431 0.798278 -0.137458
V24 V25 V26 V27 V28 Amount
0 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62
1 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69
2 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66
3 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50
4 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99
[5 rows x 30 columns]
Class Vector:
------------------------------
Class classInverse
0 0 1.0
1 0 1.0
2 0 1.0
3 0 1.0
4 0 1.0
Splitting Features and Class vectors into training and validation sets by 70/30 ratio...
Train features set size: 199364
Train Class size: 199364
Test features size: 85443
Test Class size: 85443
Started at: 2018-05-09 16:24:37.251806
Epoch: 0001 cost= 27.350340114
Epoch: 0002 cost= 35.998613088
Epoch: 0003 cost= 37.397314369
Epoch: 0004 cost= 16.843922048
Epoch: 0005 cost= 34.638866356
Epoch: 0006 cost= 49.155772096
Epoch: 0007 cost= 37.163821420
Epoch: 0008 cost= 39.222774566
Epoch: 0009 cost= 35.772652394
Epoch: 0010 cost= 31.369274977
Model saved in path: C:\Users\MOHAMMAD\Documents\Fraud\CreditCard/model.ckpt
Optimization Finished!
Accuracy: 0.99854875
Time taken: 0:00:32.633586
很抱歉,由于我刚刚开始学习tensorflow
,因此对此问题的任何错误陈述都表示不满。在此先感谢!!
答案 0 :(得分:1)
您应首先分析数据集的分布,因为此类正常欺诈数据集通常高度不平衡,这意味着与欺诈案例相比,它们的正常案例数量不成比例。例如,如果只有1%的数据集代表欺诈案件,那么甚至99%的准确度也不能说是足够好的。在这种情况下,使用精度和灵敏度是更好的准确度量。
在您的情况下,即使您将一些正常案例归类为欺诈行为,您也必须正确识别所有欺诈案件。对于这样的要求,灵敏度是一个很好的指标(灵敏度基本上代表模型在检测正面/欺诈时有多好)。