将每三行数据帧乘以不同的值

时间:2016-06-30 04:48:38

标签: python pandas numpy dataframe

我有一个包含9行的数据框。我想将前三行乘以一个值,将第二行乘以第二个值,将第三行乘以另一个值。

我使用这些变量:

import pandas as pd

df = pd.DataFrame([[i] * 5 for i in range(9)], columns=list('ABCDE'))

a = pd.Series(range(3))

print df

   A  B  C  D  E
0  0  0  0  0  0
1  1  1  1  1  1
2  2  2  2  2  2
3  3  3  3  3  3
4  4  4  4  4  4
5  5  5  5  5  5
6  6  6  6  6  6
7  7  7  7  7  7
8  8  8  8  8  8

我能够让它像这样工作:

for i, e in a.iteritems():
    start, end = i * len(a), (i + 1) * len(a)
    df.iloc[start:end] *= e

print df

    A   B   C   D   E
0   0   0   0   0   0
1   0   0   0   0   0
2   0   0   0   0   0
3   3   3   3   3   3
4   4   4   4   4   4
5   5   5   5   5   5
6  12  12  12  12  12
7  14  14  14  14  14
8  16  16  16  16  16

2 个答案:

答案 0 :(得分:3)

你可以使用numpy重塑

df.loc[:, :] = (df.values.reshape(3, df.size / 3) * np.arange(3)[:, None]).reshape(df.shape)

时序

enter image description here

答案 1 :(得分:2)

另一个解决方案多df mul numpy array numpy.repeat扩展{/ 3}}:

print (df.mul(np.repeat(a.index.values, [3] * len(a)), axis=0))
    A   B   C   D   E
0   0   0   0   0   0
1   0   0   0   0   0
2   0   0   0   0   0
3   3   3   3   3   3
4   4   4   4   4   4
5   5   5   5   5   5
6  12  12  12  12  12
7  14  14  14  14  14
8  16  16  16  16  16

计时 - (len(df)=9):

In [20]: %timeit (df.mul(np.repeat(a.index.values, [3] * len(a)), axis=0))
The slowest run took 6.12 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 197 µs per loop

In [21]: %%timeit 
    ...: df.loc[:, :] = (df.values.reshape(3, df.size / 3) * np.arange(3)[:, None]).reshape(df.shape)

__main__:257: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
The slowest run took 6.16 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 199 µs per loop

时间安排的代码 - (len(df)=90k):

df = pd.DataFrame([[i] * 5 for i in range(9)], columns=list('ABCDE'))
df = pd.concat([df]*10000).reset_index(drop=True)
a = pd.Series(range(3000))
print (df)

计时 - (len(df)=90k):

In [24]: %timeit (df.mul(np.repeat(a.index.values, [3] * len(a)), axis=0))
100 loops, best of 3: 3.58 ms per loop

In [33]: %%timeit
    ...: df.loc[:, :] = (df.values.reshape(3, df.size / 3) * np.arange(3)[:, None]).reshape(df.shape)
    ...: 
__main__:257: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
100 loops, best of 3: 10.9 ms per loop

In [34]: %%timeit
    ...: df.iloc[:, :] = (df.values.reshape(3, df.size / 3) * np.arange(3)[:, None]).reshape(df.shape)
    ...: 
__main__:257: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
100 loops, best of 3: 10.9 ms per loop