Question

我有一个.csv文件，其中的数据我想将某些列转换为一个热点。问题发生在倒数第二行，其中单热索引（例如第一特征）放在所有行中而不是我当前所在的行中。我如何访问2D列表似乎有些问题......有什么建议吗？谢谢

def one_hot_encode(data_list, column):
    one_hot_list = [[]]
    different_elements = []

    for row in data_list[1:]:                  # count different elements
        if row[column] not in different_elements:
            different_elements.append(row[column])

    for i in range(len(different_elements)):   # set variable names
        one_hot_list[0].append(different_elements[i])

    vector = []                              # create list shape with zeroes
    for i in range(len(different_elements)):
        vector.append(0)
    for i in range(1460):
        one_hot_list.append(vector)

    ind_row = 1                                # encode 1 for each sample
    for row in data_list[1:]:
        index = different_elements.index(row[column])
        one_hot_list[ind_row][index] = 1     # mistake!! sets all rows to 1
        ind_row += 1

Answer 1

您的问题源于您创建的用于执行单热编码的vector对象;您已经创建了一个对象，然后构建了一个one_hot_list，其中包含对同一对象的1460个引用。当您对其中一行进行更改时，它将反映在所有行中。

快速解决方案是为每行创建vector的单独副本（请参阅How to clone or copy a list?）：

one_hot_list.append(vector[:])

你在函数中做的其他一些事情有点慢或迂回。我建议做一些改变：

def one_hot_encode(data_list, column):
    one_hot_list = [[]]

    # count different elements
    different_elements = set(row[column] for row in data_list[1:])

    # convert different_elements to a list with a canonical order,
    # store in the first element of one_hot_list
    one_hot_list[0] = sorted(different_elements)

    vector = [0] * len(different_elements)   # create list shape with zeroes
    one_hot_list.extend([vector[:] for _ in range(1460)])

    # build a mapping of different_element values to indices into
    # one_hot_list[0]
    index_lookup = dict((e,i) for (i,e) in enumerate(one_hot_list[0]))
    # encode 1 for each sample
    for rindex, row in enumerate(data_list[1:], 1):
        cindex = index_lookup[row[column]]
        one_hot_list[rindex][cindex] = 1

使用different_elements数据类型以线性时间构建set，并使用列表推导来生成one_hot_list[0]的值（一元热的元素值列表）编码），零vector和one_hot_list[1:]（这是实际的单热编码矩阵值）。此外，还有一个dict名为index_lookup，可让您快速将元素值映射到整数索引，而不是一遍又一遍地搜索它们。最后，one_hot_list可以为您管理enumerate矩阵的行索引。

Answer 2

我不是百分之百确定你想要做什么，但你看到的问题是这些问题：

for i in range(1460):
    one_hot_list.append(vector)

这些正在创建one_hot_list作为对相同向量的零的1460个引用。而我认为你希望它每次都成为一个新的载体。直接修复只是每次都复制它：

for i in range(1460):
    one_hot_list.append(vector[:])

但更多的Pythonic方法是创建一个理解列表。也许是这样的：

vector_size = len(different_elements):
one_hot_list = [ [0] * vector_size for i in range(1460)]

Answer 3

您可以使用set（）来计算列表中的唯一项目

 different_elements = list(set(data[1:]))

Answer 4

我建议你避免在纯Python中重新实现这个问题。您可以使用pandas.get_dummies来实现此目的：

首先是一些测试数据（test.csv）：

A
Foo
Bar
Baz

然后在Python中：

import pandas as pd

df = pd.read_csv('test.csv')
# convert column 'A' to one-hot encoding
pd.get_dummies(df['A'])

您可以使用以下方法检索基础numpy数组：

pd.get_dummies(df['A']).values

结果是：

array([[0, 0, 1],
       [1, 0, 0],
       [0, 1, 0]], dtype=uint8)

单热编码，访问列表元素

4 个答案: