Question

这个问题涉及及时处理大量观察数据集。后期工作需要在观察之间进行共同的时间步骤，但在实践中，原始数据通常会错过时间步长。给定一个时间步长（比如说1秒），这个问题的目标是在原始数据中观察到的整个范围内，使用Pyspark 添加对应于任何缺失时间步长的行。

我通过以下方式实现了这一目标：

使用Python中的最小和最大观察时间以及假定的公共时间步骤生成新的时间值序列
从此序列创建新的Spark数据框，并将其连接到原始数据上。

我的问题是在Pyspark中是否有更有效或自然的方法来解决这个问题（或者如果不是我的方法是否有任何明显的改进）？

我特别感兴趣的是，是否可以在Pyspark中有效地解决这个问题，而不是像this question中那样用Java中的代码解决这个问题。

我详细介绍了我的解决方案，以及下面设置和创建可重现的测试数据。

我的解决方案

spark = SparkSession \
.builder \
.appName("Spark StackOverflow Test") \
.getOrCreate()

df = spark.read\
.options(header=True, inferSchema=True)\
.csv('test_data.csv')

# find min and max observed times after timesteps have been subsampled
df.createOrReplaceTempView('test_view')
tmin = spark.sql('select min(date) from test_view').collect()[0]['min(date)']
tmax = spark.sql('select max(date) from test_view').collect()[0]['max(date)']

# create full second-by-second index
new_date_index = takewhile(lambda x: x <= tmax,
        date_seq_generator(tmin, datetime.timedelta(seconds=1)))

# create Spark dataframe for new time index
index_schema = StructType([StructField("date", StringType())])
time_rdd = sc.parallelize([datetime.datetime.strftime(t, '%Y-%m-%d %H:%M:%S')
                       for t in new_date_index])
df_dates = spark.createDataFrame(time_rdd.map(lambda s: s.split(',')),
                                 schema=index_schema)
# cast new index type from string to timestamp
df_dates = df_dates.withColumn("date", df_dates["date"].cast(TimestampType()))

# join the spark dataframes to reindex
reindexed = df_dates.join(df,
                      how='left',
                      on= df_dates.date == df.date).select([df_dates.date, df.foo])

设置和创建虚拟可再现数据

基本形式：

                  date       foo
0  2018-01-01 00:00:00  0.548814
1  2018-01-01 00:00:01  0.715189
2  2018-01-01 00:00:02  0.602763
3  2018-01-01 00:00:03  0.544883
4  2018-01-01 00:00:04  0.423655
5  2018-01-01 00:00:05  0.645894
6  2018-01-01 00:00:08  0.963663
...

代码：

import datetime
import pandas as pd
import numpy as np
from itertools import takewhile
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
from pyspark.sql.functions import col

# set seed for data
np.random.seed(0)

def date_seq_generator(start, delta):
    """
    Generator function for time observations.

    :param start: datetime start time
    :param delta: timedelta between observations
    :returns: next time observation
    """
    current = start - delta
    while True:
        current += delta
        yield current

def to_datetime(datestring):
    """Convert datestring to correctly-formatted datetime object."""
    return datetime.datetime.strptime(datestring, '%Y-%m-%d %H:%M:%S')

# choose an arbitrary time period
start_time = to_datetime('2018-01-01 00:00:00')
end_time = to_datetime('2018-01-02 00:00:00')

# create the full time index between the start and end times
initial_times = list(takewhile(lambda x: x <= end_time,
            date_seq_generator(start_time, datetime.timedelta(seconds=1))))

# create dummy dataframe in Pandas
pd_df = pd.DataFrame({'date': initial_times,
                      'foo': np.random.uniform(size =len(initial_times))})

# emulate missing time indices
pd_df = pd_df.sample(frac=.7)

# save test data
pd_df.to_csv('test_data.csv', index=False)

Answer 1

使用Scala在Spark上完成日期：

    import org.joda.time._
    import org.joda.time.format._
    import org.joda.time.format.DateTimeFormat
    import org.joda.time.DateTime
    import org.joda.time.Days
    import org.joda.time.Duration
    import org.apache.spark.sql.functions._
    import org.joda.time.LocalDate

      def dateComplete(dataFrameDate0: DataFrame, colName: String): DataFrame ={  
    def dayIterator(start: LocalDate, end: LocalDate) = Iterator.iterate(start)(_ plusDays 1) takeWhile (_ isBefore end)

    def dateSeries( date1 : String,date2 : String) : Array[String]= {
    val fromDate = new LocalDate(date1)
    val toDate = new LocalDate(date2)
    val series = dayIterator(fromDate,toDate).toArray
    val arr = series.map(a => a.toString())
    arr
    }
    val rangos = dataFrameDate0.agg(min($"invoicedate").as("minima_fecha"),         
    max($"invoicedate").as("maxima_fecha") )
    val serie_date = spark.sparkContext.parallelize(dateSeries( 
    rangos.select("minima_fecha", "maxima_fecha").take(1)(0)(0).toString, 
    rangos.select("minima_fecha", "maxima_fecha").take(1)(0)(1).toString )).toDF(colName)
    serie_date.join(dataFrameDate0, Seq(colName), "left")
    }

    val pivoteada=dateComplete(prod_group_day,"invoicedate").groupBy("key_product").pivot("invoicedate").agg(sum("cantidad_prod").as("cantidad"))

在Pyspark中大规模重新索引（添加行）

1 个答案: