我最近编写了一个关于有序logit模型的函数
但是在运行大数据时需要花费很多时间
所以我想重写代码并将 numpy.where 函数替换为 if 语句。
我的新代码存在一些问题,我不知道该怎么做
如果你知道,请帮助我。非常感谢你!
这是我原来的功能。
import numpy as np
from scipy.stats import logistic
def func(y, X, thresholds):
ll = 0.0
for row in zip(y, X):
if row[0] == 0:
ll += logistic.logcdf(thresholds[0] - row[1])
elif row[0] == len(thresholds):
ll += logistic.logcdf(row[1] - thresholds[-1])
else:
for i in xrange(1, len(thresholds)):
if row[0] == i:
diff_prob = logistic.cdf(thresholds[i] - row[1]) - logistic.cdf(thresholds[i - 1] - row[1])
if diff_prob <= 10 ** -5:
ll += np.log(10 ** -5)
else:
ll += np.log(diff_prob)
return ll
y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
print func(y, X, thresholds)
这是新的但不完美的代码。
y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
ll = np.where(y == 0, logistic.logcdf(thresholds[0] - X),
np.where(y == len(thresholds), logistic.logcdf(X - thresholds[-1]),
np.log(logistic.cdf(thresholds[1] - X) - logistic.cdf(thresholds[0] - X))))
print ll.sum()
问题在于我不知道如何重写子循环( for x in xrange(1,len(thresholds)):)函数。
答案 0 :(得分:4)
我认为仅仅使用np.where
询问如何实现它有点X/Y problem。
因此,我将尝试解释如何优化此功能。
我的第一直觉是摆脱for
循环,这无论如何都是痛点:
import numpy as np
from scipy.stats import logistic
def func1(y, X, thresholds):
ll = 0.0
for row in zip(y, X):
if row[0] == 0:
ll += logistic.logcdf(thresholds[0] - row[1])
elif row[0] == len(thresholds):
ll += logistic.logcdf(row[1] - thresholds[-1])
else:
diff_prob = logistic.cdf(thresholds[row[0]] - row[1]) - \
logistic.cdf(thresholds[row[0] - 1] - row[1])
diff_prob = 10 ** -5 if diff_prob < 10 ** -5 else diff_prob
ll += np.log(diff_prob)
return ll
y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
print(func1(y, X, thresholds))
我刚刚用i
替换了row[0]
,而没有改变循环的语义。所以这个循环更少。
现在我希望if-else
的不同分支中的语句的表单是相同的。为此:
import numpy as np
from scipy.stats import logistic
def func2(y, X, thresholds):
ll = 0.0
for row in zip(y, X):
if row[0] == 0:
ll += logistic.logcdf(thresholds[0] - row[1])
elif row[0] == len(thresholds):
ll += logistic.logcdf(row[1] - thresholds[-1])
else:
ll += np.log(
np.maximum(
10 ** -5,
logistic.cdf(thresholds[row[0]] - row[1]) -
logistic.cdf(thresholds[row[0] - 1] - row[1])
)
)
return ll
y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
print(func2(y, X, thresholds))
现在每个分支中的表达式都是ll += expr
形式。
在这种情况下,优化可以采用几种不同的路径。您可以尝试通过将其作为一种理解来优化循环,但我怀疑它不会给你太多的速度提升。
另一条路径是将if
条件拉出循环。这就是你对np.where
的意图:
import numpy as np
from scipy.stats import logistic
def func3(y, X, thresholds):
y_0 = y == 0
y_end = y == len(thresholds)
y_rest = ~(y_0 | y_end)
ll_1 = logistic.logcdf(thresholds[0] - X[ y_0 ])
ll_2 = logistic.logcdf(X[ y_end ] - thresholds[-1])
ll_3 = np.log(
np.maximum(
10 ** -5,
logistic.cdf(thresholds[y[ y_rest ]] - X[ y_rest ]) -
logistic.cdf(thresholds[ y[y_rest] - 1 ] - X[ y_rest])
)
)
return np.sum(ll_1) + np.sum(ll_2) + np.sum(ll_3)
y = np.array([0, 1, 2])
X = np.array([2, 2, 2])
thresholds = np.array([2, 3])
print(func3(y, X, thresholds))
请注意,我将X
转换为np.array
,以便能够对其使用精美的索引。
此时,我打赌它对我的目的来说足够快。但是,您可以提前或超出此点,具体取决于您的要求。
在我的电脑上,我得到以下结果:
y = np.random.random_integers(0, 10, size=(10000,))
X = np.random.random_integers(0, 10, size=(10000,))
thresholds = np.cumsum(np.random.rand(10))
%timeit func(y, X, thresholds) # Original
1 loops, best of 3: 1.51 s per loop
%timeit func1(y, X, thresholds) # Removed for-loop
1 loops, best of 3: 1.46 s per loop
%timeit func2(y, X, thresholds) # Standardized if statements
1 loops, best of 3: 1.5 s per loop
%timeit func3(y, X, thresholds) # Vectorized ~ 500x improvement
100 loops, best of 3: 2.74 ms per loop