Python - 加速将分类变量转换为数字索引

时间:2016-06-07 07:10:19

标签: python performance numpy pandas dataframe

我需要将Pandas数据框中的一列分类变量转换为数字值,该数值对应于列中唯一分类变量数组的索引(长篇故事!),这里是一个完成该操作的代码片段:

import pandas as pd
import numpy as np

d = {'col': ["baked","beans","baked","baked","beans"]}
df = pd.DataFrame(data=d)
uniq_lab = np.unique(df['col'])

for lab in uniq_lab:
    df['col'].replace(lab,np.where(uniq_lab == lab)[0][0].astype(float),inplace=True)

转换数据框:

    col
 0  baked
 1  beans
 2  baked
 3  baked
 4  beans

进入数据框:

    col
 0  0.0
 1  1.0
 2  0.0
 3  0.0
 4  1.0

根据需要。但我的问题是,当我尝试在大数据文件上运行类似的代码时,我的愚蠢的小循环(我想到这一点的唯一方法)就像糖蜜一样慢。我只是好奇是否有人对是否有任何方法更有效地做到这一点有任何想法。提前感谢任何想法。

2 个答案:

答案 0 :(得分:5)

使用factorize

df['col'] = pd.factorize(df.col)[0]
print (df)
   col
0    0
1    1
2    0
3    0
4    1

Docs

编辑:

在评论中提及Jeff时,最好将列转换为categorical,主要是因为memory usage更少:

df['col'] = df['col'].astype("category")

<强>计时

有趣的是,在大{d} pandasnumpy更快。len(df)=500k。我不敢相信。

In [29]: %timeit (a(df1)) 100 loops, best of 3: 9.27 ms per loop In [30]: %timeit (a1(df2)) 100 loops, best of 3: 9.32 ms per loop In [31]: %timeit (b(df3)) 10 loops, best of 3: 24.6 ms per loop In [32]: %timeit (b1(df4)) 10 loops, best of 3: 24.6 ms per loop

len(df)=5k

In [38]: %timeit (a(df1)) 1000 loops, best of 3: 274 µs per loop In [39]: %timeit (a1(df2)) The slowest run took 6.71 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 3: 273 µs per loop In [40]: %timeit (b(df3)) The slowest run took 5.15 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 3: 295 µs per loop In [41]: %timeit (b1(df4)) 1000 loops, best of 3: 294 µs per loop

len(df)=5

In [46]: %timeit (a(df1)) 1000 loops, best of 3: 206 µs per loop In [47]: %timeit (a1(df2)) 1000 loops, best of 3: 204 µs per loop In [48]: %timeit (b(df3)) The slowest run took 6.30 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 164 µs per loop In [49]: %timeit (b1(df4)) The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 164 µs per loop

d = {'col': ["baked","beans","baked","baked","beans"]}
df = pd.DataFrame(data=d)
print (df)
df = pd.concat([df]*100000).reset_index(drop=True)
#test for 5k
#df = pd.concat([df]*1000).reset_index(drop=True)


df1,df2,df3, df4 = df.copy(),df.copy(),df.copy(),df.copy()

def a(df):
    df['col'] = pd.factorize(df.col)[0]
    return df

def a1(df):
    idx,_ = pd.factorize(df.col)
    df['col'] = idx
    return df

def b(df):
    df['col'] = np.unique(df['col'],return_inverse=True)[1]
    return df

def b1(df):
    _,idx = np.unique(df['col'],return_inverse=True)
    df['col'] = idx    
    return df

print (a(df1))    
print (a1(df2))   
print (b(df3))   
print (b1(df4))  

测试代码

int row = 0, col = 0;
cout << "please enter row and column"<<endl;
cin >> row >> col;

int** p = (int**) new int[row*col];
for (int i = 0; i < row; ++i)
{
    p[i] = new int[col];
    for (int j = 0; j < col; ++j)
    {
        cout << "Enter the number at position p[" << i << "][" << j << "]" << endl;
        int temp = 0;
        cin >> temp;
        p[i][j] = temp;
    }
}

for (int i = 0; i < row; ++i)
{ 
    for (int j = 0; j < col; ++j)
        cout << p[i][j] << "\t";
    cout << endl;
}

cout << "Now find the max number provided rectangle co-ordinate[a,b] to [x,y]"<<endl;
cout << "please specify the starting and ending co-ordinates"<<endl;
int a = 0, b = 0, c = 0, d = 0;
cin >> a >> b >> c >> d;
int max = p[a][b];
for (int i = a; i <= c; ++i)
{
    for (int j = b; j <= d; ++j)
    {
        if (p[i][j] > max)
        {
            max = p[i][j];
        }
    }
}

cout << "max of the rectangle = " << max<<endl;

答案 1 :(得分:3)

你可以使用np.unique的可选参数return_inverse根据每个字符串的唯一性来识别每个字符串,并在输入数据框中设置它们,如下所示 -

_,idx = np.unique(df['col'],return_inverse=True)
df['col'] = idx

请注意,IDs对应于字符串中唯一按字母顺序排序的数组。如果你必须得到那个独特的数组,你可以用它替换_,如此 -

uniq_lab,idx = np.unique(df['col'],return_inverse=True)

示例运行 -

>>> d = {'col': ["baked","beans","baked","baked","beans"]}
>>> df = pd.DataFrame(data=d)
>>> df
     col
0  baked
1  beans
2  baked
3  baked
4  beans
>>> _,idx = np.unique(df['col'],return_inverse=True)
>>> df['col'] = idx
>>> df
   col
0    0
1    1
2    0
3    0
4    1