一键编码标签提供输入标签

时间:2019-12-09 15:26:23

标签: python pandas one-hot-encoding

我正在尝试从pandas数据帧进行一次热编码,但是我无法给出category参数。我的想法是在类别和编码之间有对应的内容,例如:

CATEGORIES = ['A','B','C']
Y = pd.get_dummies(data['Article_Topic_1']).values

Y将返回例如: 对于类别A,[0,0,1],而我的想法是让类别A = [1,0,0]。

如果这不可能,是否有办法保留编码并知道确切的字符串?

2 个答案:

答案 0 :(得分:0)

也许您可以尝试使用scikit-learn进行编码? https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html 在这里,您可以找到一个全面的示例https://www.ritchieng.com/machinelearning-one-hot-encoding/

答案 1 :(得分:0)

我认为您不能直接通过get_dummies()来做到这一点。但是只是重新组织结果呢?如果我的问题正确无误,您想对单次热编码数据的列进行重新排序以匹配规定的排序。

categories = ["A", "B", "C"]
Y = pd.get_dummies(data["Article_Topic_1"])
Y = Y[categories].values

这里有一个函数检查使该解决方案有效的一些假设。

def get_dummies_for_coding(series, ordering):
    # Ordering must contain only values present in series.
    assert(len(set(ordering)-set(series.unique()))==0)
    # It's easier to work with series here, because pd.get_dummies()
    # will operate with string prefixes for data-frames, which make
    # things a bit more complicated.
    assert(isinstance(series, pd.Series))
    dummies = pd.get_dummies(series)
    dummies = dummies[ordering]
    #return dummies
    return dummies.values

# Example
df = pd.DataFrame([["a", "foo"],
                   ["a", "bar"],
                   ["b", "bar"],
                   ["a", "baz"],
                   ["b", "bar"]],
                  columns=["colA", "colB"])
orderingA = ["b", "a"]
orderingB = ["baz", "bar", "foo"]

ret = get_dummies_for_coding(df["colA"], orderingA)
print(ret)
ret = get_dummies_for_coding(df["colB"], orderingB)
print(ret)