在我的研究中,我正在尝试从breast_cancer数据生成数据。在我的代码中,我试图从原始数据中获取一些部分数据并在其上训练one class svm
。我的目标是生成与原始数据非常相似的合成数据。所以one class svm
无法区分原始和生成的区别。
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
DATA_FILE = "breastcancer.txt"
adversary_percentages = [0.01, 0.02, 0.04, 0.08, 0.16]
for adversary_percentage in adversary_percentages:
data = pd.read_csv(DATA_FILE, sep=",")
data = data.dropna(how='any', axis=0)
data = data.drop(data.columns[0], axis=1)
main_data, adversary_data = train_test_split(data, test_size=adversary_percentage, random_state=1234)
scaler = preprocessing.MinMaxScaler()
adversary_input = scaler.fit_transform(adversary_data)
svc = svm.OneClassSVM(nu=0.01, gamma=0.1).fit(adversary_input)
predicted = svc.predict(adversary_input)
print("accuracy: " + str(np.round(sum(predicted[predicted==1.0])/len(predicted),2)))
在代码结束时,我意识到一些我无法解释的异常。虽然one class svm
增加的原始数据的百分比增加,但较少的数据被标记为异常值并且准确性增加。你能帮我理解一下这个问题吗?
percentage_of_data_one_class_has: 0.01 accuracy: 0.79
percentage_of_data_one_class_has: 0.02 accuracy: 0.78
percentage_of_data_one_class_has: 0.04 accuracy: 0.88
percentage_of_data_one_class_has: 0.08 accuracy: 0.94
percentage_of_data_one_class_has: 0.16 accuracy: 0.96