Python:大数据的单热编码

时间:2016-12-09 10:54:24

标签: python one-hot-encoding

我一直在尝试将字符串标签编码为单热编码时遇到内存问题。大约有500万行和大约10000种不同的标签。我尝试过以下操作,但不断出现内存错误:

from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
label_fitter = lb.fit(y)
y = label_fitter.transform(y)

我也试过这样的事情:

import numpy as np

def one_hot_encoding(y):
    unique_values = set(y)
    label_length = len(unique_values)
    enu_uniq = zip(unique_values , range(len(unique_values)))
    dict1 = dict(enu_uniq)
    values = []
    for i in y:
        temp = np.zeros((label_length,), dtype="float32")
        if i in dict1:
            temp[dict1[i]] = 1.0
        values.append(temp)
    return np.array(values)

还有记忆错误。有提示吗?有些人在堆栈中问同样的问题,但没有答案似乎有用。

2 个答案:

答案 0 :(得分:3)

Your main problem seem to be that the binarized y doesn't fit into your memory. You can work with sparse arrays to avoid this.

>>> import numpy as np
>>> from scipy.sparse import csc_matrix
>>> y = np.random.randint(0, 10000, size=5000000) # 5M random integers [0,10K)

You can transform those labels y to a 5M x 10K sparse matrix as follows:

>>> dtype = np.uint8 # change to np.bool if you want boolean or other data type
>>> rows = np.arange(y.size) # each of the elements of `y` is a row itself
>>> cols = y # `y` indicates the column that is going to be flagged
>>> data = np.ones(y.size, dtype=dtype) # Set to `1` each (row,column) pair
>>> ynew = csc_matrix((data, (rows, cols)), shape=(y.size, y.max()+1), dtype=dtype)

ynew is then a sparse matrix where each row is full of zeros except one entry:

>>> ynew
<5000000x10000 sparse matrix of type '<type 'numpy.uint8'>'
     with 5000000 stored elements in Compressed Sparse Column format>

You will have to adapt your code to learn how to deal with sparse matrices, but is probably the best choice you have. Additionally, you can recover full rows or columns from the sparse matrix as:

>>> row0 = ynew[0].toarray() # row0 is a standard numpy array

For string labels or labels of arbitrary data type:

>>> y = ['aaa' + str(i) for i in np.random.randint(0, 10000, size=5000000)] # e.g. 'aaa9937'

First extract a mapping from labels to integers:

>>> labels = np.unique(y) # List of unique labels
>>> mapping = {u:i for i,u in enumerate(labels)}
>>> inv_mapping = {i:u for i,u in enumerate(labels)} # Only needed if you want to recover original labels at some point

The above mapping maps each of the labels to an integer (based on the order that they are stored in the unique set labels).

And then create the sparse matrix again:

>>> N, M = len(y), labels.size
>>> dtype = np.uint8 # change np.bool if you want boolean
>>> rows = np.arange(N)
>>> cols = [mapping[i] for i in y]
>>> data = np.ones(N, dtype=dtype)
>>> ynew = csc_matrix((data, (rows, cols)), shape=(N, M), dtype=dtype)

You can create (although is not needed) the inverse mapping if in the future you want to know label X to which original label maps:

>>> inv_mapping = {i:u for i,u in enumerate(labels)}
>>> inv_mapping[10] # ---> something like 'aaaXXX'

答案 1 :(得分:2)

在提出问题时可能尚未提供此问题,但LabelBinarizer需要"start": "node scripts/start.js", 个参数。

sparse_output