“Motelling”是一种平滑信号响应的方法。
例如:给定时变信号S t 取整数值1-5,响应函数F t ({S 0。 ..t })为每个信号分配[-1,0,+ 1],标准的激动响应函数将返回:
如果我在信号{S}的时间内有一个DataFrame,是否有一种矢量化的方式来应用这个motelling函数?
,例如DataFrame df['S'].values = [1, 2, 2, 2, 3, 5, 3, 4, 1]
那么是否有一种矢量化方法可以产生:
df['F'].values = [-1, -1, -1, -1, 0, 1, 0, 0, -1]
或者,如果没有矢量化解决方案,是否有比我现在使用的DataFrame.itertuples()
方法明显更快的东西?
df = pd.DataFrame(np.random.random_integers(1,5,100000), columns=['S'])
# First set response for time t
df['F'] = np.where(df['S'] == 5, 1, np.where(df['S'] == 1, -1, 0))
# Now loop to apply motelling
previousF = 0
for row in df.itertuples():
df.at[row.Index, 'F'] = np.where((row.S >= 4) & (previousF == 1), 1,
np.where((row.S <= 2) & (previousF == -1), -1, row.F))
previousF = row.F
使用复杂的DataFrame,循环部分需要O(每百万行一分钟)!
答案 0 :(得分:1)
你可以试试正则表达式。
我们正在寻找的模式是
(1)1跟随1或2.(我们选择此规则,因为1之后的任何2可以被视为1并且影响下一行的结果)
(2)5跟随4或5.(同样,5之后的任何4可以被视为5)
(1)将导致连续 11111111111111111
0000000000000000
+ 1111111111111111
====================
0000000000000000
s和(2)将导致连续-1
s。其余不匹配的将是0。
使用这些规则,剩下的工作就是做替换。我们特别使用方法1
,可以将匹配的结果转换为此类匹配的长度。 (见参考资料)
lambda m: "x"*len(m.group(0))
更大的数据集
import re
s = [1, 2, 2, 2, 3, 5, 3, 4, 1]
str_s = "".join(str(i) for i in s)
s1 = re.sub("5[45]*", lambda m: "x"*len(m.group(0)),str_s)
s2 = re.sub("1[12]*", lambda m: "y"*len(m.group(0)),s1)
l = list(s2)
l2 = [v if v in ["x", "y"] else 0 for v in l]
l3 = [1 if v == 'x' else v for v in l2]
l4 = [-1 if v == 'y' else v for v in l3]
[-1, -1, -1, -1, 0, 1, 0, 0, -1]
def tai(s):
str_s = "".join(str(i) for i in s)
s1 = re.sub("5[45]*", lambda m: "x"*len(m.group(0)),str_s)
s2 = re.sub("1[12]*", lambda m: "y"*len(m.group(0)),s1)
l = list(s2)
l2 = [v if v in ["x", "y"] else 0 for v in l]
l3 = [1 if v == 'x' else v for v in l2]
l4 = [-1 if v == 'y' else v for v in l3]
return l4
s = np.random.randint(1,6,100000)
%timeit tai(s)
104 ms ± 6.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each
参考
Replace substrings in python with the length of each substring
答案 1 :(得分:1)
您可能会注意到,由于F [t]的连续元素彼此依赖,因此不能很好地矢量化。在这种情况下,我偏爱使用numba。你的功能很简单,它适用于一个numpy数组(系列只是引擎盖下的数组)并且它不易于矢量化 - &gt; numba非常适合这种情况。
进口和功能:
import numpy as np
import pandas as pd
def motel(S):
F = np.zeros_like(S)
for t in range(S.shape[0]):
if (S[t] == 1) or (S[t] == 2 and F[t-1] == -1):
F[t] = -1
elif (S[t] == 5) or (S[t] == 4 and F[t-1] == 1):
F[t] = 1
# no else required sinze it's already set to zero
return F
这里我们可以只是jit-compile函数
import numba
jit_motel = numba.jit(nopython=True)(motel)
确保normal和jit版本返回预期值
S = pd.Series([1, 2, 2, 2, 3, 5, 3, 4, 1])
print("motel(S) = ", motel(S))
print("jit_motel(S)", jit_motel(S.values))
结果:
motel(S) = [-1 -1 -1 -1 0 1 0 0 -1]
jit_motel(S) [-1 -1 -1 -1 0 1 0 0 -1]
对于时间安排,我们来衡量:
N = 10**4
S = pd.Series( np.random.randint(1, 5, N) )
%timeit jit_motel(S.values)
%timeit motel(S.values)
结果:
82.7 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
7.75 ms ± 77.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
对于你的百万数据点(没有时间正常功能,因为我不想等待=))
N = 10**6
S = pd.Series( np.random.randint(1, 5, N) )
%timeit motel(S.values)
结果:
768 ms ± 7.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
轰!一百万条款不到一秒钟。这种方法简单,易读且快速。唯一的缺点是Numba依赖,但它包含在anaconda中并且很容易在conda中使用(也许是pip我不确定)。
答案 2 :(得分:1)
为了汇总其他答案,首先我应该注意,显然DataFrame.itertuples()
没有确定性地或按预期迭代,因此OP中的样本并不总能在大样本上产生正确的结果。
感谢其他答案,我意识到机动逻辑的机械应用不仅可以产生正确的结果,而且在我们使用DataFrame.fill
函数时会非常快速地执行:
def dfmotel(df):
# We'll copy results into column F as we build them
df['F'] = np.nan
# This algo is destructive, so we operate on a copy of the signal
df['temp'] = df['S']
# Fill forward the negative signal
df.loc[df['temp'] == 2, 'temp'] = np.nan
df['temp'].ffill(inplace=True)
df.loc[df['temp'] == 1, 'F'] = -1
# Fill forward the positive signal
df.loc[df['temp'] == 4, 'temp'] = np.nan
df['temp'].ffill(inplace=True)
df.loc[df['temp'] == 5, 'F'] = 1
# All other signals are zero
df['F'].fillna(0, inplace=True)
对于所有时序测试,我们将使用相同的输入:
df = pd.DataFrame(np.random.randint(1,5,1000000), columns=['S'])
对于上面基于DataFrame的函数,我们得到:
%timeit dfmotel(df.copy())
123 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
这是完全可以接受的表现。
tai was first to present this very clever solution using RegEx(这是我上面的功能的启发),但它无法与数字空间的停留速度相匹配:
import re
def tai(s):
str_s = "".join(str(i) for i in s)
s1 = re.sub("5[45]*", lambda m: "x"*len(m.group(0)),str_s)
s2 = re.sub("1[12]*", lambda m: "y"*len(m.group(0)),s1)
l = list(s2)
l2 = [v if v in ["x", "y"] else 0 for v in l]
l3 = [1 if v == 'x' else v for v in l2]
l4 = [-1 if v == 'y' else v for v in l3]
return l4
%timeit tai(df['S'].values)
899 ms ± 9.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
但没有什么比编译代码更好。感谢evamicur for this solution using the convenient numba in-line compiler:
import numba
def motel(S):
F = np.zeros_like(S)
for t in range(S.shape[0]):
if (S[t] == 1) or (S[t] == 2 and F[t-1] == -1):
F[t] = -1
elif (S[t] == 5) or (S[t] == 4 and F[t-1] == 1):
F[t] = 1
return F
jit_motel = numba.jit(nopython=True)(motel)
%timeit jit_motel(df['S'].values)
9.06 ms ± 502 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)