我正在尝试从pandas数据帧进行一次热编码,但是我无法给出category参数。我的想法是在类别和编码之间有对应的内容,例如:
CATEGORIES = ['A','B','C']
Y = pd.get_dummies(data['Article_Topic_1']).values
Y将返回例如: 对于类别A,[0,0,1],而我的想法是让类别A = [1,0,0]。
如果这不可能,是否有办法保留编码并知道确切的字符串?
答案 0 :(得分:0)
也许您可以尝试使用scikit-learn进行编码? https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html 在这里,您可以找到一个全面的示例https://www.ritchieng.com/machinelearning-one-hot-encoding/
答案 1 :(得分:0)
我认为您不能直接通过get_dummies()
来做到这一点。但是只是重新组织结果呢?如果我的问题正确无误,您想对单次热编码数据的列进行重新排序以匹配规定的排序。
categories = ["A", "B", "C"]
Y = pd.get_dummies(data["Article_Topic_1"])
Y = Y[categories].values
这里有一个函数检查使该解决方案有效的一些假设。
def get_dummies_for_coding(series, ordering):
# Ordering must contain only values present in series.
assert(len(set(ordering)-set(series.unique()))==0)
# It's easier to work with series here, because pd.get_dummies()
# will operate with string prefixes for data-frames, which make
# things a bit more complicated.
assert(isinstance(series, pd.Series))
dummies = pd.get_dummies(series)
dummies = dummies[ordering]
#return dummies
return dummies.values
# Example
df = pd.DataFrame([["a", "foo"],
["a", "bar"],
["b", "bar"],
["a", "baz"],
["b", "bar"]],
columns=["colA", "colB"])
orderingA = ["b", "a"]
orderingB = ["baz", "bar", "foo"]
ret = get_dummies_for_coding(df["colA"], orderingA)
print(ret)
ret = get_dummies_for_coding(df["colB"], orderingB)
print(ret)