一种针对年龄分类数据的热编码

时间:2019-02-24 02:36:18

标签: python-3.x sklearn-pandas one-hot-encoding

当尝试使用一个热编码器为以下类别实现编码时,出现了couldn't convert string to float错误。

['0-17', '55+', '26-35', '46-50', '51-55', '36-45', '18-25']

1 个答案:

答案 0 :(得分:0)

I made something real quick that should work. You will see that I had a really nasty looking one-liner for preconditioning your limits; however, it will be much easier if you just convert the limits directly to the proper format.

Essentially, this just iterates through a list of limits and makes comparisons to the limits. If the sample of data is less than the limit, we make that index a 1 and break.

import random

# str_limits = ['0-17', '55+', '26-35', '46-50', '51-55', '36-45', '18-25']
#
# oneline conditioning for the limit string format
# limits = sorted(list(filter(lambda x: not x.endswith("+"), map(lambda v: v.split("-")[-1], str_limits))))
# limits.append('1000')

# do this instead
limits = sorted([17, 35, 50, 55, 45, 25, 1000])

# sample 100 random datapoints between 0 and 65 for testing
samples = [random.choice(list(range(65))) for i in range(100)]

onehot = []  # this is where we will store our one-hot encodings
for sample in samples:
    row = [0]*len(limits)  # preallocating a list
    for i, limit in enumerate(limits):
        if sample <= limit:
            row[i] = 1
            break

    # storing that sample's onehot into a onehot list of lists
    onehot.append(row)

for i in range(10):
    print("{}: {}".format(onehot[i], samples[i]))

I am not sure about the specifics of your implementation, but you are probably forgetting to convert from a string to an integer at some point.