在Pyspark中大规模重新索引(添加行)

时间:2018-05-30 15:41:01

标签: python python-3.x apache-spark pyspark pyspark-sql

这个问题涉及及时处理大量观察数据集。后期工作需要在观察之间进行共同的时间步骤,但在实践中,原始数据通常会错过时间步长。给定一个时间步长(比如说1秒),这个问题的目标是在原始数据中观察到的整个范围内,使用Pyspark 添加对应于任何缺失时间步长的行。

我通过以下方式实现了这一目标:

  1. 使用Python中的最小和最大观察时间以及假定的公共时间步骤生成新的时间值序列
  2. 从此序列创建新的Spark数据框,并将其连接到原始数据上。
  3. 我的问题是在Pyspark中是否有更有效或自然的方法来解决这个问题(或者如果不是我的方法是否有任何明显的改进)?

    我特别感兴趣的是,是否可以在Pyspark中有效地解决这个问题,而不是像this question中那样用Java中的代码解决这个问题。

    我详细介绍了我的解决方案,以及下面设置和创建可重现的测试数据。

    我的解决方案

    spark = SparkSession \
    .builder \
    .appName("Spark StackOverflow Test") \
    .getOrCreate()
    
    df = spark.read\
    .options(header=True, inferSchema=True)\
    .csv('test_data.csv')
    
    # find min and max observed times after timesteps have been subsampled
    df.createOrReplaceTempView('test_view')
    tmin = spark.sql('select min(date) from test_view').collect()[0]['min(date)']
    tmax = spark.sql('select max(date) from test_view').collect()[0]['max(date)']
    
    # create full second-by-second index
    new_date_index = takewhile(lambda x: x <= tmax,
            date_seq_generator(tmin, datetime.timedelta(seconds=1)))
    
    # create Spark dataframe for new time index
    index_schema = StructType([StructField("date", StringType())])
    time_rdd = sc.parallelize([datetime.datetime.strftime(t, '%Y-%m-%d %H:%M:%S')
                           for t in new_date_index])
    df_dates = spark.createDataFrame(time_rdd.map(lambda s: s.split(',')),
                                     schema=index_schema)
    # cast new index type from string to timestamp
    df_dates = df_dates.withColumn("date", df_dates["date"].cast(TimestampType()))
    
    # join the spark dataframes to reindex
    reindexed = df_dates.join(df,
                          how='left',
                          on= df_dates.date == df.date).select([df_dates.date, df.foo])
    

    设置和创建虚拟可再现数据

    基本形式:

                      date       foo
    0  2018-01-01 00:00:00  0.548814
    1  2018-01-01 00:00:01  0.715189
    2  2018-01-01 00:00:02  0.602763
    3  2018-01-01 00:00:03  0.544883
    4  2018-01-01 00:00:04  0.423655
    5  2018-01-01 00:00:05  0.645894
    6  2018-01-01 00:00:08  0.963663
    ...
    

    代码:

    import datetime
    import pandas as pd
    import numpy as np
    from itertools import takewhile
    from pyspark.sql.types import StructType, StructField, StringType, TimestampType
    from pyspark.sql.functions import col
    
    # set seed for data
    np.random.seed(0)
    
    def date_seq_generator(start, delta):
        """
        Generator function for time observations.
    
        :param start: datetime start time
        :param delta: timedelta between observations
        :returns: next time observation
        """
        current = start - delta
        while True:
            current += delta
            yield current
    
    def to_datetime(datestring):
        """Convert datestring to correctly-formatted datetime object."""
        return datetime.datetime.strptime(datestring, '%Y-%m-%d %H:%M:%S')
    
    # choose an arbitrary time period
    start_time = to_datetime('2018-01-01 00:00:00')
    end_time = to_datetime('2018-01-02 00:00:00')
    
    # create the full time index between the start and end times
    initial_times = list(takewhile(lambda x: x <= end_time,
                date_seq_generator(start_time, datetime.timedelta(seconds=1))))
    
    # create dummy dataframe in Pandas
    pd_df = pd.DataFrame({'date': initial_times,
                          'foo': np.random.uniform(size =len(initial_times))})
    
    # emulate missing time indices
    pd_df = pd_df.sample(frac=.7)
    
    # save test data
    pd_df.to_csv('test_data.csv', index=False)
    

1 个答案:

答案 0 :(得分:0)

使用Scala在Spark上完成日期:

    import org.joda.time._
    import org.joda.time.format._
    import org.joda.time.format.DateTimeFormat
    import org.joda.time.DateTime
    import org.joda.time.Days
    import org.joda.time.Duration
    import org.apache.spark.sql.functions._
    import org.joda.time.LocalDate

      def dateComplete(dataFrameDate0: DataFrame, colName: String): DataFrame ={  
    def dayIterator(start: LocalDate, end: LocalDate) = Iterator.iterate(start)(_ plusDays 1) takeWhile (_ isBefore end)

    def dateSeries( date1 : String,date2 : String) : Array[String]= {
    val fromDate = new LocalDate(date1)
    val toDate = new LocalDate(date2)
    val series = dayIterator(fromDate,toDate).toArray
    val arr = series.map(a => a.toString())
    arr
    }
    val rangos = dataFrameDate0.agg(min($"invoicedate").as("minima_fecha"),         
    max($"invoicedate").as("maxima_fecha") )
    val serie_date = spark.sparkContext.parallelize(dateSeries( 
    rangos.select("minima_fecha", "maxima_fecha").take(1)(0)(0).toString, 
    rangos.select("minima_fecha", "maxima_fecha").take(1)(0)(1).toString )).toDF(colName)
    serie_date.join(dataFrameDate0, Seq(colName), "left")
    }

    val pivoteada=dateComplete(prod_group_day,"invoicedate").groupBy("key_product").pivot("invoicedate").agg(sum("cantidad_prod").as("cantidad"))