有没有更快,更有效的方法来完成最后两行?也许用在哪里?
import pandas as pd
import numpy as np
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8,2), index=index, columns=['A', 'B'])
for second, group in df.groupby(level='second'):
df.loc[group.index, 'A'] = np.random.randn(1)[0]
答案 0 :(得分:1)
您的数据:
import pandas as pd
import numpy as np
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']*100000,
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']*100000]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8*100000,2), index=index, columns=['A', 'B'])
从那里开始,我想现在我没有数据框中的任何数据信息,但它有一个“第二个”数据。索引,你想为' A'生成random.randn。列依赖于' second'索引。
second_index = df.index.names.index('second')
second_labels = df.index.labels[second_index]
no_second_labels = len(df.index.levels[second_index])
rands = np.random.randn(no_second_labels)
df.A = rands[second_labels]
#My solution
%%timeit
second_index = df.index.names.index('second')
second_labels = df.index.labels[second_index]
no_second_labels = len(df.index.levels[second_index])
rands = np.random.randn(no_second_labels)
df.A = rands[second_labels]
#100 loops, best of 3: 11.1 ms per loop
#Alexander's solution
%%timeit
randoms = {n: np.random.randn(1)[0] for n, _ in enumerate(df.index.levels[1])}
df['A'] = [randoms[n] for n in df.index.labels[1].tolist()]
#1 loops, best of 3: 188 ms per loop
答案 1 :(得分:1)
您可以先创建一组附加到索引第二级(df.index.levels[1]
)中每个项目的随机数。然后,您可以使用列表推导来循环遍历该级别的每个标签并映射随机数。
np.random.seed(0)
randoms = {n: np.random.randn(1)[0] for n, _ in enumerate(df.index.levels[1])}
df['A'] = [randoms[n] for n in df.index.labels[1].tolist()]
>>> df
A B
first second
bar one 1.764052 0.144044
two 0.400157 0.761038
baz one 1.764052 0.443863
two 0.400157 1.494079
foo one 1.764052 0.313068
two 0.400157 -2.552990
qux one 1.764052 0.864436
two 0.400157 2.269755
%%timeit
for second, group in df.groupby(level='second'):
df.loc[group.index, 'A'] = np.random.randn(1)[0]
1000 loops, best of 3: 1.99 ms per loop
%%timeit
randoms = {n: np.random.randn(1)[0] for n, _ in enumerate(df.index.levels[1])}
df['A'] = [randoms[n] for n in df.index.labels[1].tolist()]
10000 loops, best of 3: 120 µs per loop