Question

我有一个名为r的大型实际1-d数据集。我想绘图：

mean(log(1+a*r)) vs a, with a > -1 .

这是我的代码：

   rr=pd.read_csv('goog.csv')
   dd=rr['Close']
   series=pd.Series(dd)
   seriespct=series.pct_change()
   seriespct[0]=seriespct.mean()

   dum1 =[0]*len(dd)

   a=1.
   a_max = 1.
   a_step = 0.01

   a = scipy.arange(-3.+a_step, a_max, a_step)
   n = len(a)
   dum2 =[0]*n
   m=len(dd)

   for j in range(n):
      for i in range(m):
         dum1[i]=math.log(1+a[j]*seriespct[i])

   dum2[j]=scipy.mean(dum1)


   plt.plot(a,dum2)
   plt.show()

我怎样才能以更优雅的方式做到这一点？

Answer 1

我建议这样做：

plt.plot(a, np.log(1 + r*a[:,None]).mean(1))

这具有很大的速度优势，因为它避免了for循环，并且在数据集很大的情况下，numpy中完成的循环要快得多。

In [49]: a = np.arange(a_step-.3, a_max, a_step)

In [50]: r = np.random.random(100)

In [51]: timeit [scipy.mean(log(1+a[i]*r)) for i in range(len(a))]
100 loops, best of 3: 5.47 ms per loop

In [52]: timeit np.log(1 + r*a[:,None]).mean(1)
1000 loops, best of 3: 384 µs per loop

按broadcasting运行，以便a沿一个轴变化，r沿另一个轴变化，然后您可以沿r变化的轴取均值，所以你仍然有一个随a变化的数组（并且形状与a相同）：

import numpy as np
import matplotlib.pyplot as plt

r = np.random.random(100)

a = 1.
a_max = 1.
a_step = 0.01
a = np.arange(a_step-.3, a_max, a_step)
a.shape
#(129,)
a = a[:,None] #adds a new axis, making this a column vector, same as: a = a.reshape(-1,1)
a.shape
#(129, 1)
(a*r).shape
#(129, 100)
loga = np.log(1 + a*r)
loga.shape
#(129,100)
mloga = loga.mean(axis=1) #take the mean along the 2nd axis where `a` varies
mloga.shape
#(129,)

plt.plot(a, mloga)
plt.show()

附录：

为避免依赖广播，您可以使用np.outer：

plt.plot(a, np.log(1 + np.outer(a,r)).mean(1))

无需重新塑造a（跳过步骤a = a[:,None]）

这是一个更简单的例子，所以你可以看到发生了什么：

r = np.exp(np.arange(1,5))
a = np.arange(5)

In [33]: r
Out[33]: array([  2.71828183,   7.3890561 ,  20.08553692,  54.59815003])

In [34]: a
Out[34]: array([0, 1, 2, 3, 4])

In [39]: r*a[:,None]
Out[39]: 
# this is  2.7...         7.3...        20.08...       54.5...         # times:
array([[   0.        ,    0.        ,    0.        ,    0.        ],   # 0
       [   2.71828183,    7.3890561 ,   20.08553692,   54.59815003],   # 1
       [   5.43656366,   14.7781122 ,   40.17107385,  109.19630007],   # 2
       [   8.15484549,   22.1671683 ,   60.25661077,  163.7944501 ],   # 3
       [  10.87312731,   29.5562244 ,   80.34214769,  218.39260013]])  # 4

In [40]: np.outer(a,r)
Out[40]: 
array([[   0.        ,    0.        ,    0.        ,    0.        ],
       [   2.71828183,    7.3890561 ,   20.08553692,   54.59815003],
       [   5.43656366,   14.7781122 ,   40.17107385,  109.19630007],
       [   8.15484549,   22.1671683 ,   60.25661077,  163.7944501 ],
       [  10.87312731,   29.5562244 ,   80.34214769,  218.39260013]])

# this is the mean of each column:
In [41]: (np.outer(a,r)).mean(1)
Out[41]: array([  0.        ,  21.19775622,  42.39551244,  63.59326866,  84.79102488])

# and the log of 1 + the above is:
In [42]: np.log(1+(np.outer(a,r)).mean(1))
Out[42]: array([ 0.        ,  3.09999121,  3.77035604,  4.16811021,  4.4519144 ])

Answer 2

你可以用scipy做手段。

您可以使用matplotlib进行绘图。

import scipy
from matplotlib import pyplot

#convert r from a python list to an 1-D array
r = scipy.array(r)

#edit these
a_max = 100
a_step = 0.1

a = scipy.arange(-1+a_step, a_max, a_step)
n = len(a)

pyplot.plot(a, [scipy.mean(log(1+a[i]*r)) for i in range(n)], 'b-')
pyplot.show()

在Python中绘制参数均值

2 个答案:

附录：