带有分类变量的数据的单一scikit-learn / pandas算法

时间:2018-11-24 21:50:31

标签: python pandas scikit-learn

我对编程解决方案感兴趣,这是我在数据科学堆栈交换中提出的一个概念性问题。似乎没有基于回复(https://datascience.stackexchange.com/questions/41606/single-machine-learning-algorithm-for-multiple-classes-of-data-one-hot-encoder)的简单算法。所以我想知道编程的最佳方法是什么?

在仍然使用一种机器学习模型和一个数据框的同时,如何使用pandas和scikit-learn来获取组合数据,从而获得与分离数据相同的准确性?拆分数据并创建单独的模型是在pandas和scikit-learn中进行编程以获得最佳精度的唯一方法吗?

import pandas as pd
from sklearn.linear_model import LinearRegression

# Dataframe with x1 = 0 and linear regression gives a slope of 1 as expected

df = pd.DataFrame(data=[{'x1': 0, 'x2': 1, 'y': 1},
                        {'x1': 0, 'x2': 2, 'y': 2},
                        {'x1': 0, 'x2': 3, 'y': 3},
                        {'x1': 0, 'x2': 4, 'y': 4}
                        ],
                  columns=['x1', 'x2', 'y'])

X = df[['x1', 'x2']]
y = df['y']
reg = LinearRegression().fit(X, y)
print(reg.predict(np.array([[0, 5]]))) # Output is 5 as expected

# Dataframe with x1 = 1 and linear regression gives a slope of 5 as expected

df = pd.DataFrame(data=[{'x1': 1, 'x2': 1, 'y': 4},
                        {'x1': 1, 'x2': 2, 'y': 8},
                        {'x1': 1, 'x2': 3, 'y': 12},
                        {'x1': 1, 'x2': 4, 'y': 16}
                        ],
                  columns=['x1', 'x2', 'y'])

X = df[['x1', 'x2']]
y = df['y']
reg = LinearRegression().fit(X, y)
print(reg.predict(np.array([[1, 5]]))) # Output is 20 as expected 

# Combine the two data frames x1 = 0 and x1 = 1 

df = pd.DataFrame(data=[{'x1': 0, 'x2': 1, 'y': 1},
                        {'x1': 0, 'x2': 2, 'y': 2},
                        {'x1': 0, 'x2': 3, 'y': 3},
                        {'x1': 0, 'x2': 4, 'y': 4},
                        {'x1': 1, 'x2': 1, 'y': 4},
                        {'x1': 1, 'x2': 2, 'y': 8},
                        {'x1': 1, 'x2': 3, 'y': 12},
                        {'x1': 1, 'x2': 4, 'y': 16}
                        ],
                  columns=['x1', 'x2', 'y'])

X = df[['x1', 'x2']]
y = df['y']
reg = LinearRegression().fit(X, y)
print(reg.predict(np.array([[0, 5]]))) # Output is 8.75 while optimal solution in 5 
print(reg.predict(np.array([[1, 5]]))) # Output is 16.25 while optimal solution in 20

# use one hot encoder

df = pd.get_dummies(df, columns=["x1"], prefix=["x1"])
X = df[['x1_0', 'x1_1', 'x2']]
y = df['y']
reg = LinearRegression().fit(X, y)
print(reg.predict(np.array([[1, 0, 5]]))) # Output is 8.75 while optimal solution in 5
print(reg.predict(np.array([[0, 1, 5]]))) # Output is 16.25 while optimal solution in 20

0 个答案:

没有答案