熊猫系列的二进制移位

时间:2019-04-02 13:05:20

标签: python pandas

我在pandas数据框中有一些布尔变量,我需要获取所有唯一的元组。所以我的想法是创建一个新的变量值串联列,然后使用pandas.DataFrame.unique()获得所有唯一元组。

所以我的想法是使用二进制开发进行连接。例如,对于数据框:

import pandas as pd
df = pd.DataFrame({'v1':[0,1,0,0,1],'v2':[0,0,0,1,1], 'v3':[0,1,1,0,1], 'v4':[0,1,1,1,1]})

我可以这样创建一列:

df['added'] = df['v1'] + df['v2']*2 + df['v3']*4 + df['v4']*8

我的想法是像这样遍历变量列表(应该注意的是,在我的实际问题上,我不知道列数):

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
   df['added'] = df['added'] + df[var] << ind

这将引发错误:“ TypeError:<<:'Series'和'int'的不受支持的操作数类型。

我可以像这样用pandas.DataFrame.apply()解决我的问题:

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
   df['added'] = df['added'] + df[var].apply(lambda x : x << ind )

但是,应用(通常)很慢。我怎样才能更有效地做事情?

预先感谢

M

3 个答案:

答案 0 :(得分:1)

获取唯一行is the same operation作为drop_duplicates。 (通过查找所有重复的行并将其删除,只剩下唯一的行。)

df[["v2","v3","v4"]].drop_duplicates()

答案 1 :(得分:1)

仅使用this solution进行简化,因为订购已经交换:

df['new'] = df.values.dot(1 << np.arange(df.shape[-1]))
print (df)
   v1  v2  v3  v4  new
0   0   0   0   0    0
1   1   0   1   1   13
2   0   0   1   1   12
3   0   1   0   1   10
4   1   1   1   1   15

1000行和4列中的效果:

np.random.seed(2019)

N= 1000
df = pd.DataFrame(np.random.choice([0,1], size=(N, 4)))
df.columns = [f'v{x+1}' for x in df.columns]

In [60]: %%timeit
    ...: df['new'] = df.values.dot(1 << np.arange(df.shape[-1]))
113 µs ± 1.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Yuca解决方案:

In [65]: %%timeit
    ...: variables = ['v1', 'v2', 'v3', 'v4']
    ...: df['added'] = df['v1']
    ...: for ind, var in enumerate(variables[1:]) :
    ...:     df['added'] = df['added'] + [x<<ind for x in df[var]]
    ...: 
1.82 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

原始解决方案:

In [66]: %%timeit
    ...: variables = ['v1', 'v2', 'v3', 'v4']
    ...: df['added'] = df['v1']
    ...: for ind, var in enumerate(variables[1:]) :
    ...:    df['added'] = df['added'] + df[var].apply(lambda x : x << ind )
    ...: 
3.14 ms ± 8.52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

答案 2 :(得分:0)

回答您有关更有效替代方法的问题时,我发现列表理解确实对您有所帮助:

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
    %timeit df['added'] = df['added'] + [x<<ind for x in df[var]]

308 µs ± 22.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
322 µs ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
316 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

所以315 µs vs:

variables = ['v1', 'v2', 'v3', 'v4']
df['added'] = df['v1']
for ind, var in enumerate(variables[1:]) :
    %timeit df['added'] = df['added'] + df[var].apply(lambda x : x << ind )

500 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
503 µs ± 32.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
481 µs ± 32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

作为免责声明,我不同意总和的价值,但这是一个不同的话题:)