Question

我试图理解如何使用scikit进行有监督的机器学习，因此我已经编制了一些属于两个不同集合的数据：集合A和集合B.我在集合A中有18个元素集合B中的18个元素。每个元素都有三个变量。见下文：

#SetA
Variable1A = [ 3,4,4,5,4,5,5,6,7,7,5,4,5,6,4,9,3,4]
Variable2A = [ 5,4,4,3,4,5,4,5,4,3,4,5,3,4,3,4,4,3]
Variable3A = [ 7,8,4,5,6,7,3,3,3,4,4,9,7,6,8,6,7,8]


#SetB
Variable1B = [ 7,8,11,12,7,9,8,7,8,11,15,9,7,6,9,9,7,11]
Variable2B = [ 1,2,3,3,4,2,4,1,0,1,2,1,3,4,3,1,2,3]
Variable3B = [ 12,18,14,15,16,17,13,13,13,14,14,19,17,16,18,16,17,18]

我如何使用scikit来使用受监督的机器学习，这样当我引入新的setA和setB数据时，它可以尝试识别哪些新数据属于setA或setB。

数据集的道歉很小并且组成了＃39;我只想在其他数据集上使用scikit应用相同的方法。

Answer 1

你的问题很广泛，所以这只是一个简短的概述。您不希望以这种方式格式化数据，而是将两个集合放在一个列表/数组中，而另一列则表示每行所属的集合。像这样：

data = [
    [3, 5, 7, 0]
    [4, 4, 8, 0],  # these rows have 0 as the last element to represent group A
    ...
    [7, 1, 12, 1],
    [8, 2, 18, 1], # these have 1 as the last element to represent group A
    ...
]

另一种方法是仅将前三列放在data中并将其称为X，然后使用一个单独的数组y仅包含[0, 0, 0, ..., 1, 1, 1, ...]（表示组成员身份）每一行）。您要避免的是将关于哪个组的点的信息存储在变量的名称中;你想要设置A或设置B＆＃34;存储在变量的值中的信息（因为它存储在data的最后一列或y中的值中），

无论你做什么，你几乎肯定希望使用numpy数组或pandas数据结构来保存你的数据，而不是列表。

有许多关于如何使用scikit-learn的教程和示例，以及可能比您组成的数据集更有用的示例数据集。＆＃34;监督机器学习＆＃34;是一个广义的术语，包含许多不同的方法来决定数据点所在的组，因此您必须四处游戏并尝试不同的分类算法。所有这些信息都可以通过谷歌搜索和/或浏览scikit文档找到。

Answer 2

我认为这是一个很好的问题，如果你感觉不够清楚就不用担心。监督学习可用于将实例（数据行）分类为几个类别（或者在您的情况下仅为2组）。您在上面的示例中缺少的是一个变量，它表示第1行属于哪一行。

import numpy as np # numpy will help us to concatenate the columns into a 2-dimensional array
# so instead of hiving 3 separate arrays, we have 1 array with 3 columns and 18 rows 

Variable1A = [ 3,4,4,5,4,5,5,6,7,7,5,4,5,6,4,9,3,4]
Variable2A = [ 5,4,4,3,4,5,4,5,4,3,4,5,3,4,3,4,4,3]
Variable3A = [ 7,8,4,5,6,7,3,3,3,4,4,9,7,6,8,6,7,8]

#our target variable for A

target_variable_A=[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]

Variable1B = [ 7,8,11,12,7,9,8,7,8,11,15,9,7,6,9,9,7,11]
Variable2B = [ 1,2,3,3,4,2,4,1,0,1,2,1,3,4,3,1,2,3]
Variable3B = [ 12,18,14,15,16,17,13,13,13,14,14,19,17,16,18,16,17,18]

# target variable for B
target_variable_B=[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

#lets create a dataset C with only 4 rows that we need to predict if belongs to "1" which is data set A or "0" which is  dataset B

Variable1C = [ 7,4,4,12]
Variable2C = [ 1,4,4,3]
Variable3C = [ 12,8,4,15]

#make the objects 2-dimenionsal arrays (so 1 array with X rows and 3 columns-variables)
Dataset_A=np.column_stack((Variable1A,Variable2A,Variable3A))
Dataset_B=np.column_stack((Variable1B,Variable2B,Variable3B))
Dataset_C=np.column_stack((Variable1C,Variable2C,Variable3C))

print(" dataset A rows ", Dataset_A.shape[0]," dataset A columns ", Dataset_A.shape[1] )
print(" dataset B rows ", Dataset_B.shape[0]," dataset B columns ", Dataset_B.shape[1] )
print(" dataset C rows ", Dataset_C.shape[0]," dataset C columns ", Dataset_C.shape[1] )

##########Prints ##########
#(' dataset A rows ', 18L, ' dataset A columns ', 3L)
#(' dataset B rows ', 18L, ' dataset B columns ', 3L)
#(' dataset C rows ', 4L, ' dataset C columns ', 3L)

# since now we have an identification that tells us if it belongs to A or B (e.g. 1 or 0) we can append the new sets together
Dataset_AB=np.concatenate((Dataset_A,Dataset_B),axis=0) # this creates a set with 36 rows and 3 columns
target_variable_AB=np.concatenate((target_variable_A,target_variable_B),axis=0)

print(" dataset AB rows ", Dataset_AB.shape[0]," dataset Ab columns ", Dataset_AB.shape[1] )
print(" target Variable rows ", target_variable_AB.shape[0])

##########Prints ##########
#(' dataset AB rows ', 36L, ' dataset Ab columns ', 3L)
#(' target Variable rows ', 36L)

#now we will select the most common supervised scikit model - Logistic Regression
from sklearn.linear_model import LogisticRegression
model=LogisticRegression() # we create an instance of the model

model.fit(Dataset_AB,target_variable_AB) # the model learns to distinguish between A and B (1 or 0)

#now we make predictions for the new dataset C

predictions_for_C=model.predict(Dataset_C)
print(predictions_for_C)
# this will print
#[0 1 1 0]
# so first case belongs to set A , second to B, third to B and fourth to A

Answer 3

监督学习意味着您为训练模型提供的数据被标记为已预先知道用于训练的每个样本的结果。

在提供的问题中，基本上有2套：A组和B组，因此您将不得不使用Logistic回归模型之类的二进制分类器。

根据集合A和B所属的集合，首先将集合A和B的元素标记为1或0，反之亦然，也就是说，如果元素e属于集合A，则将其标记为1，否则将其标记为0。

然后从python中的scikitlearn导入Logistic回归分类器。

接下来的事情是合并两个集合（如集合A），然后合并集合B，反之亦然，并以相同的顺序合并您已经提供的标签。

您可以使用pandas或numpy来堆叠这些设置并准备标记的数据集。

现在您有了一个标记良好的数据集。

您现在可以从Logistic回归分类器中使用数据集（包含集合A和集合B元素）和标签集来调用fit函数。

在调用带有您要测试的数据的预测函数之后，您将获得0或1的预测类。

如果需要集合，则可以使用字典将键分别映射为1和0以及值“集合A”和“集合B”。这样您就可以从中获取布景了。

import pandas as pd
import numpy as np 
from sklearn.linear_model import LogisticRegression as lr

#set A

firstA=[3,4,4,5,4,5,5,6,7,7,5,4,5,6,4,9,3,4]
secondA=[5,4,4,3,4,5,4,5,4,3,4,5,3,4,3,4,4,3]
thirdA=[7,8,4,5,6,7,3,3,3,4,4,9,7,6,8,6,7,8]

#set B

firstB=[7,8,11,12,7,9,8,7,8,11,15,9,7,6,9,9,7,11]
secondB=[1,2,3,3,4,2,4,1,0,1,2,1,3,4,3,1,2,3]
thirdB=[12,18,14,15,16,17,13,13,13,14,14,19,17,16,18,16,17,18]

#stacking up and building the dataset

Aset=[firstA,secondA,thirdA]
Bset=[firstB,secondB,thirdB]
totalset=[Aset,Bset]


data=pd.DataFrame(columns["0","1","2","3","4","5","6",
"7","8","9","10","11","12","13","14","15","16","17"])
c=0
for i in range(0,2):
    for j in range(0,3):
        data.loc[c]=totalset[i][j]
        c=c+1 
label=np.array([0,0,0,1,1,1])
df2=pd.DataFrame(columns=["0","1","2","3","4","5"])
df2=label


#Training and testing the model

model=lr()
model.fit(df,df2)
k=model.predict([[17,18,14,15,16,17,13,
13,13,41,14,19,17,16,18,16,17,28]])

#mapping(chosen set A element's with label 0 and set B with 1)

dic={0:"set A",1:"set B"}
print(dic[int(k)])

Python监督机器学习

3 个答案: