均值(似然)编码

时间:2019-08-19 10:38:35

标签: python machine-learning scikit-learn data-science

我有一个名为“数据”的数据集,该数据集要使用均值(似然/目标)编码而不是标签编码进行分类。

我的数据集如下:

document.querySelector(".report-btn").addEventListener("click", () => {
  const pdfDoc = new jsPDF();
  const list = createListforPdf(); // function to generate list

  pdfDoc.autoTable({
    theme: "grid",
    tableWidth: 100,
    styles: { halign: "center" },
    columnStyles: {
      0: { cellWidth: 85, halign: "left" },
      1: { cellWidth: 15 }
    },

    head: [["MRZ Check number", "is OK?"]],
    body: [[list[0], "YES"], [list[1], "NO"]]
  });

  pdfDoc.save();
});

我尝试过:

data.head()

ID  X0  X1  X10 X100    X101    X102    X103    X104    X105    ... X90 X91 X92 X93 X94 X95 X96 X97 X98 X99
0   0   k   v   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
1   6   k   t   0   1   1   0   0   0   0   ... 0   0   0   0   0   0   1   0   1   0
2   7   az  w   0   0   1   0   0   0   0   ... 0   0   0   0   0   0   1   0   1   0
3   9   az  t   0   0   1   0   0   0   0   ... 0   0   0   0   0   0   1   0   1   0
4   13  az  v   0   0   1   0   0   0   0   ... 0   0   0   0   0   0   1   0   1   0
5 rows × 377 columns

它引发:

  

KeyError:错误

我也尝试过:

# Select categorical features
cat_features = data.dtypes == 'object'

# Define function
def mean_encoding(df, cols, target):

    for c in cols:
        means = df.groupby(c)[target].mean()
        df[c].map(means)

    return df

# Encode
data = mean_encoding(data, cat_features, target)

它引发:

  

KeyError:'找不到列:87.68,87.43,94.38,72.11,73.7,74.0,   74.28、76.26,...

我已将训练和测试数据集归为一个名为“数据”的数据,并在将数据集放入以下数据集之前保存了训练目标:

# Define function
def mean_encoding(df, target):

    for c in df.columns:
        if df[c].dtype == 'object':
            means = df.groupby(c)[target].mean()
            df[c].map(means)

    return df

我们将不胜感激。谢谢。

1 个答案:

答案 0 :(得分:1)

我认为您没有正确选择分类列。通过执行cat_features = data.dtypes == 'object',您不会得到列名,而是会得到布尔值,显示列类型是否为类别。导致KeyError:False

您可以将分类列选择为

mycolumns = data.columns
numerical_columns = data._get_numeric_data().columns
cat_features= list(set(mycolumns) - set(numerical_columns))

cat_features = df.select_dtypes(['object']).columns

其余代码相同

  # Define function
  def mean_encoding(df, cols, target):

     for c in cols:
        means = df.groupby(c)[target].mean()
        df[c].map(means)

    return df

# Encode
data = mean_encoding(data, cat_features, target)