找出每10秒内熊猫的值变化的平均和标准偏差

时间:2018-12-04 20:53:52

标签: python python-3.x pandas dataframe time-series

我有一个来自不同传感器的读数数据集。目的是在所有传感器上每10秒获取一次平均变化和标准偏差(即最终输出只是时间戳,平均变化和变化的标准偏差)

由于数据如此之大,我最初使用Spark处理2亿行:

原始数据如下:

>>> df.show(10, False)

+--------------+-----------------------+-----------+-------+
+integerdate   |        event_timestamp|  sensor_id|reading|
+--------------+-----------------------+-----------+-------+
|20180703      |2018-07-03 10:32:50.473|      Front|  54.82|
|20180703      |2018-07-03 15:59:50.616|      Front|  54.54|
|20180703      |2018-07-03 14:49:55.718|      Front|  54.64|
|20180703      |2018-07-03 09:30:00.003|       Bore|  55.60|
|20180703      |2018-07-03 15:08:16.099|       Bore|  54.66|
|20180703      |2018-07-03 09:30:54.837|      Atten|  57.08|
|20180703      |2018-07-03 09:40:24.333|      Atten|  57.08|
|20180703      |2018-07-03 10:06:01.027|      Atten|  56.69|
|20180703      |2018-07-03 10:06:28.787|      Atten|  56.70|
|20180703      |2018-07-03 10:14:32.675|      Atten|  56.64|
+--------------+-----------------------+-----------+-------+

但是,由于我只对读数的变化感兴趣,因此我使用Spark的窗口和分析功能lag来获取每天每个传感器的第一读数,并且仅获取传感器读数变化的时间:< / p>

>>> df1 = df\
  .withColumn('previous_reading',
              F.lag(df.reading, 1)
              .over(Window.partitionBy('integerdate', 'sensor_id')
                    .orderBy(df.event_timestamp)))\
  .filter((F.col('previous_reading').isNull()) | (F.col('reading') != F.col('previous_reading')))\
  .withColumn('reading_change', F.bround(df.reading - F.col('previous_reading'), 2))\
  .withColumn('previous_timestamp',
              F.lag(df.event_timestamp, 1)
              .over(Window
                    .partitionBy('integerdate', 'sensor_id')
                    .orderBy(df.event_timestamp)))\
  .withColumn('seconds_elapsed',
              time_elapsed_udf(F.struct(df.event_timestamp,
                                        F.col('previous_timestamp'))))

>>> df1.show(10, False)
+-----------+-----------------------+-----------+-------+----------------+--------------+-----------------------+---------------+
|integerdate|event_timestamp        |sensor_id  |reading|previous_reading|reading_change|previous_timestamp     |seconds_elapsed|
+-----------+-----------------------+-----------+-------+----------------+--------------+-----------------------+---------------+
|20180703   |2018-07-03 09:32:00.972|Back       |  0.365|null            |null          |null                   |null           |
|20180703   |2018-07-03 09:36:04.096|Anter      |  0.210|null            |null          |null                   |null           |
|20180703   |2018-07-03 11:59:17.118|Anter      |  0.250|0.21            |0.04          |2018-07-03 09:36:04.096|8593           |
|20180703   |2018-07-03 12:47:40.309|Alloc      |  47.99|null            |null          |null                   |null           |
|20180703   |2018-07-03 08:00:13.931|Bore       |  2.730|null            |null          |null                   |null           |
|20180703   |2018-07-03 09:30:00.003|Bore       |  2.750|2.73            |0.02          |2018-07-03 08:00:13.931|5386           |
|20180703   |2018-07-03 09:30:00.003|Bore       |  2.710|2.75            |-0.04         |2018-07-03 09:30:00.003|0              |
|20180703   |2018-07-03 09:30:00.697|Bore       |  2.780|2.71            |0.07          |2018-07-03 09:30:00.003|0              |
|20180703   |2018-07-03 09:32:47.269|Bore       |  2.730|2.78            |-0.05         |2018-07-03 09:30:00.697|166            |
|20180703   |2018-07-03 09:34:50.814|Bore       |  2.760|2.73            |0.03          |2018-07-03 09:32:47.269|123            |
+-----------+-----------------------+-----------+-------+----------------+--------------+-----------------------+---------------+

这使行的总数减少到大约2000万,而我可以轻松地在Pandas中进行处理。

>>> pd_df = df1.toPandas()
>>> pd_df.head(10)

     integerdate  event_timestamp           sensor_id  reading    previous_reading    reading_change        previous_timestamp     seconds_elapsed
0    20180703     2018-07-03 09:32:00.972        Back    0.365                 NaN               NaN                       NaT                 NaN
1    20180703     2018-07-03 09:36:04.096       Anter    0.210                 NaN               NaN                       NaT                 NaN
2    20180703     2018-07-03 11:59:17.118       Anter    0.250                0.21              0.04   2018-07-03 09:36:04.096                8593 
3    20180703     2018-07-03 12:47:40.309       Alloc    47.99                 NaN               NaN                       NaT                 NaN  
4    20180703     2018-07-03 08:00:13.931        Bore    2.730                 NaN               NaN                       NaT                 NaN  
5    20180703     2018-07-03 09:30:00.003        Bore    2.750                2.73              0.02   2018-07-03 08:00:13.931                5386 
6    20180703     2018-07-03 09:30:00.003        Bore    2.710                2.75             -0.04   2018-07-03 09:30:00.003                   0 
7    20180703     2018-07-03 09:30:00.697        Bore    2.780                2.71              0.07   2018-07-03 09:30:00.003                   0 
8    20180703     2018-07-03 09:32:47.269        Bore    2.730                2.78             -0.05   2018-07-03 09:30:00.697                 166  
9    20180703     2018-07-03 09:34:50.814        Bore    2.760                2.73              0.03   2018-07-03 09:32:47.269                 123  

我认为我只需要event_timestampsensor_idreading并使用熊猫resample()apply()就可以每10秒读取一次读数。

我决定在我的Spark转换中看到previous_readingprevious_timestamptime_elapsed,以确保窗口和lag()正常工作。

现在,挑战在于,对于某些传感器,读数每隔几微秒或几秒钟变化一次,但是对于某些传感器,读数在很多小时内都不会变化。

如何使用熊猫resample()获取每10秒的读数变化,然后每10秒求出传感器读数的平均变化和变化的标准偏差?

最终输出应采用以下格式:

event_timestamp     avg_change  std_dev
2018-07-03 09:00:10       0.05     0.02
2018-07-03 09:00:20       0.21     0.01
2018-07-03 09:00:30       0.58     0.12
2018-07-03 09:00:40       0.71     0.45
2018-07-03 09:00:50       1.14     0.78
2018-07-03 09:01:00       1.05     0.79
2018-07-03 09:01:10       5.05     0.24
2018-07-03 09:01:20       1.96     0.30
2018-07-03 09:01:30       0.51     0.01
2018-07-03 09:01:40       0.14     0.02

让我知道是否需要我提供更多信息?

0 个答案:

没有答案