Question

一台机器可以提供数千个传感器的数据。机器在时间一打开金属条。在下一次金属条被加热并且在第三次金属条被冷却。通过时间戳，测得的速度和触发器（例如输入/输出烤箱），可以在ETL步骤中生成一个带变量。

+----------------+----------+-----------+---------+-----+
|time            |input_oven|output_oven|temp_oven|speed|
+----------------+----------+-----------+---------+-----+
|2017-01-01-01-20|0         |0          |450      |3    |
|2017-01-01-01-21|0         |0          |450      |3    |
|2017-01-01-01-22|1         |0          |450      |3    |
|2017-01-01-01-23|0         |0          |450      |4    |
|2017-01-01-01-24|0         |0          |451      |4    |
|2017-01-01-01-25|0         |1          |450      |4    |
|2017-01-01-01-26|0         |0          |450      |3    |
+----------------+----------+-----------+---------+-----+

如您所见，速度可能会有所不同。我尝试了以下代码，但这太不准确了，还因为例如机器可以停止。

from scipy import integrate
s = lambda s: col_speed*col_time
integrate.quad(s, time_1, time_2)

因此，必须通过速度变量执行积分，以便可以生成新的仪表变量。一个文件包含3万个5000个传感器的条目。

结果必须是一个与所有传感器数据平行的表，这样我才能看到：一个金属条纹计已经经历了炉温和冷却速度。

非常感谢您的帮助，在此先感谢您。

编辑

为了提供进一步的见解，我添加了以下图片。

Time series of several sensor signals of one production line. The green line represents the current time. The yellow line represents the same length position at different times stamps.

ETL作业的目标应该是所有传感器信号相对于长度位置的对齐。因此，我想到了使用以下公式：

length = speed * time
time = time_delta(output_oven-input_oven)
speed = avg(speed)

对于给定的示例数据，对于完整的DataFrame，应像这样求解方程

length = avg(speed) * time_delta(output_oven-input_oven)
length = 4 m/min * 2017-01-01-01-25-2017-01-01-01-22
length = 4 m/min * 3 min = 12 m

现在，我知道我的金属条的哪一部分穿过了烤箱。假设我的金属乐队长12米。我现在想根据长度将所有其他传感器信号滞后。

Answer 1

这是我的尝试，这接近您想要的吗？

from pyspark.sql import functions as f
from pyspark.sql import Row

Columns = Row('time','input_oven','output_oven','temp_oven','speed')
x=[Columns(20,0,0 ,450,3),
Columns(21,0,0 ,450,3),
Columns(22,1,0 ,450,3),
Columns(23,0,0 ,450,4),
Columns(24,0,0 ,451,4),
Columns(25,0,1 ,450,4),
Columns(26,0,0 ,450,3)]

df = spark.createDataFrame(x).withColumn('id', f.lit(1))
df.printSchema()

df1 = df.withColumn('oven', df['input_oven']+df['output_oven'])

from pyspark.sql.window import Window

w = Window.partitionBy(df['id']).orderBy(df['time'])
cum_oven = f.sum(df1['oven']).over(w)
df2 = df1.select(df1['time'],df1['speed'], df1['output_oven'],cum_oven.alias('cum_oven'))
df3 = df2.withColumn('cum_oven', df2['cum_oven']-df2['output_oven']).drop(df2['output_oven'])

ws = Window.partitionBy(df3['cum_oven']).orderBy(df3['time'])
metal_length = (f.max(df3['time']).over(ws)-f.min(df3['time']).over(ws))*df3['speed']

df4 = df3.select(df3['time'], df3['cum_oven'], metal_length.alias('metal_length'))

fdf = df.join(df4, ['time'])
fdf.drop('id').sort('time').show()

+----+----------+-----------+---------+-----+--------+------------+
|time|input_oven|output_oven|temp_oven|speed|cum_oven|metal_length|
+----+----------+-----------+---------+-----+--------+------------+
|  20|         0|          0|      450|    3|       0|           0|
|  21|         0|          0|      450|    3|       0|           3|
|  22|         1|          0|      450|    3|       1|           0|
|  23|         0|          0|      450|    4|       1|           4|
|  24|         0|          0|      451|    4|       1|           8|
|  25|         0|          1|      450|    4|       1|          12|
|  26|         0|          0|      450|    3|       2|           0|
+----+----------+-----------+---------+-----+--------+------------+

最终积分只是一个groupBy，max和和？

如何在Pyspark中实现对速度的有效时间长度转换？

1 个答案: