我正在尝试阅读工资数据集Wages.csv。然后尝试将列分开。但我得到一个例外,显示数据必须是1维
下面复制了代码,并给出了数据集链接。
# import modules
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline
# read data_set
data = pd.read_csv("Wage.csv")
data.head()
data_x = data['age']
data_y = data['wage']
# Dividing data into train and validation datasets
from sklearn.model_selection import train_test_split
train_x, valid_x, train_y, valid_y = train_test_split(data_x, data_y, test_size=0.33, random_state = 1)
# Dividing the data into 4 bins
df_cut, bins = pd.cut(train_x, 4, retbins=True, right=True)
df_cut.value_counts(sort=False)
df_steps = pd.concat([train_x, df_cut, train_y], keys=['age','age_cuts','wage'], axis=1)
# Create dummy variables for the age groups
df_steps_dummies = pd.get_dummies(df_cut)
df_steps_dummies.head()
df_steps_dummies.columns = ['17.938-33.5','33.5-49','49-64.5','64.5-80']
# Fitting Generalised linear models
fit3 = sm.GLM(df_steps.wage, df_steps_dummies).fit()
# Binning validation set into same 4 bins
bin_mapping = np.digitize(valid_x, bins)
X_valid = pd.get_dummies(bin_mapping)
我得到一个例外 例外:数据必须是1维的
答案 0 :(得分:0)
如果您查看数据,则表格如下: [1] [2] ... [3]
你需要得到像[1 2 ... 3]
这样的东西将数据展平为单个列表,然后将其放回到np.array中。
例如:
代码:
def binMapping(x):
flat = []
prestep = np.digitize(x, bins)
for sublist in prestep:
for ele in sublist:
flat.append(ele)
return np.array(flat)
bin_mapping = binMapping(valid_x)
X_valid = pd.get_dummies(bin_mapping)
这很有效。我确信有更好的方法。