在Pandas中为组切片设置值的最快方法

时间:2015-12-17 00:54:27

标签: python pandas multi-index

有没有更快,更有效的方法来完成最后两行?也许用在哪里

import pandas as pd
import numpy as np

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

tuples = list(zip(*arrays))

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

df = pd.DataFrame(np.random.randn(8,2), index=index, columns=['A', 'B'])

for second, group in df.groupby(level='second'):
    df.loc[group.index, 'A'] = np.random.randn(1)[0]

2 个答案:

答案 0 :(得分:1)

编辑:将数组乘以100000以模拟大数据+比较时间

您的数据:

import pandas as pd
import numpy as np

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']*100000,
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']*100000]

tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8*100000,2), index=index, columns=['A', 'B'])

从那里开始,我想现在我没有数据框中的任何数据信息,但它有一个“第二个”数据。索引,你想为' A'生成random.randn。列依赖于' second'索引。

second_index = df.index.names.index('second')
second_labels = df.index.labels[second_index]
no_second_labels = len(df.index.levels[second_index])
rands = np.random.randn(no_second_labels)

df.A = rands[second_labels]

#My solution
%%timeit
second_index = df.index.names.index('second')
second_labels = df.index.labels[second_index]
no_second_labels = len(df.index.levels[second_index])
rands = np.random.randn(no_second_labels)
df.A = rands[second_labels]
#100 loops, best of 3: 11.1 ms per loop

#Alexander's solution
%%timeit
randoms = {n: np.random.randn(1)[0] for n, _ in enumerate(df.index.levels[1])}
df['A'] = [randoms[n] for n in df.index.labels[1].tolist()]
#1 loops, best of 3: 188 ms per loop

答案 1 :(得分:1)

您可以先创建一组附加到索引第二级(df.index.levels[1])中每个项目的随机数。然后,您可以使用列表推导来循环遍历该级别的每个标签并映射随机数。

np.random.seed(0)
randoms = {n: np.random.randn(1)[0] for n, _ in enumerate(df.index.levels[1])}
df['A'] = [randoms[n] for n in df.index.labels[1].tolist()]

>>> df
                     A         B
first second                    
bar   one     1.764052  0.144044
      two     0.400157  0.761038
baz   one     1.764052  0.443863
      two     0.400157  1.494079
foo   one     1.764052  0.313068
      two     0.400157 -2.552990
qux   one     1.764052  0.864436
      two     0.400157  2.269755

%%timeit
for second, group in df.groupby(level='second'):
    df.loc[group.index, 'A'] = np.random.randn(1)[0]
1000 loops, best of 3: 1.99 ms per loop

%%timeit
randoms = {n: np.random.randn(1)[0] for n, _ in enumerate(df.index.levels[1])}
df['A'] = [randoms[n] for n in df.index.labels[1].tolist()]
10000 loops, best of 3: 120 µs per loop