Question

我不确定这是否已经是最快的方法，或者我的效率是否低效。

我想对具有27k +可能级别的特定分类列进行热编码。该列在2个不同的数据集中具有不同的值，因此在使用get_dummies（）

之前我首先合并了这些级别

%s/\(a-z\)\zs : \1\ze//

然而，它已运行超过2个小时，它仍然是热编码。

我可以在这里做错事吗？或者只是在大型数据集上运行它的本质？

Df有6.8米行和27列，在热编码我想要的列之前，Df2有19990行和27列。

建议表示赞赏，谢谢！：）

Answer 1

我简要回顾了get_dummies source code，我认为它可能无法充分利用您的用例的稀疏性。以下方法可能更快，但我没有尝试将其一直扩展到您拥有的19M记录：

import numpy as np
import pandas as pd
import scipy.sparse as ssp

np.random.seed(1)
N = 10000

dfa = pd.DataFrame.from_dict({
    'col1': np.random.randint(0, 27000, N)
    , 'col2b': np.random.choice([1, 2, 3], N)
    , 'target': np.random.choice([1, 2, 3], N)
    })

# construct an array of the unique values of the column to be encoded
vals = np.array(dfa.col1.unique())
# extract an array of values to be encoded from the dataframe
col1 = dfa.col1.values
# construct a sparse matrix of the appropriate size and an appropriate,
# memory-efficient dtype
spmtx = ssp.dok_matrix((N, len(vals)), dtype=np.uint8)
# do the encoding. NB: This is only vectorized in one of the two dimensions.
# Finding a way to vectorize the second dimension may yield a large speed up
for idx, val in enumerate(vals):
    spmtx[np.argwhere(col1 == val), idx] = 1

# Construct a SparseDataFrame from the sparse matrix and apply the index
# from the original dataframe and column names.
dfnew = pd.SparseDataFrame(spmtx, index=dfa.index,
                           columns=['col1_' + str(el) for el in vals])
dfnew.fillna(0, inplace=True)

<强>更新

借鉴其他答案here和here的见解，我能够在两个维度中对解决方案进行矢量化。在我的有限测试中，我注意到构造SparseDataFrame似乎会将执行时间增加几倍。因此，如果您不需要返回类似DataFrame的对象，则可以节省大量时间。此解决方案还处理需要将2个以上DataFrame编码为具有相同列数的2-d数组的情况。

import numpy as np
import pandas as pd
import scipy.sparse as ssp

np.random.seed(1)
N1 = 10000
N2 = 100000

dfa = pd.DataFrame.from_dict({
    'col1': np.random.randint(0, 27000, N1)
    , 'col2a': np.random.choice([1, 2, 3], N1)
    , 'target': np.random.choice([1, 2, 3], N1)
    })

dfb = pd.DataFrame.from_dict({
    'col1': np.random.randint(0, 27000, N2)
    , 'col2b': np.random.choice(['foo', 'bar', 'baz'], N2)
    , 'target': np.random.choice([1, 2, 3], N2)
    })

# construct an array of the unique values of the column to be encoded
# taking the union of the values from both dataframes.
valsa = set(dfa.col1.unique())
valsb = set(dfb.col1.unique())
vals = np.array(list(valsa.union(valsb)), dtype=np.uint16)


def sparse_ohe(df, col, vals):
    """One-hot encoder using a sparse ndarray."""
    colaray = df[col].values
    # construct a sparse matrix of the appropriate size and an appropriate,
    # memory-efficient dtype
    spmtx = ssp.dok_matrix((df.shape[0], vals.shape[0]), dtype=np.uint8)
    # do the encoding
    spmtx[np.where(colaray.reshape(-1, 1) == vals.reshape(1, -1))] = 1

    # Construct a SparseDataFrame from the sparse matrix
    dfnew = pd.SparseDataFrame(spmtx, dtype=np.uint8, index=df.index,
                               columns=[col + '_' + str(el) for el in vals])
    dfnew.fillna(0, inplace=True)
    return dfnew

dfanew = sparse_ohe(dfa, 'col1', vals)
dfbnew = sparse_ohe(dfb, 'col1', vals)

pd.get_dummies（）在较大的级别上变慢

1 个答案: