在Spark

时间:2015-12-29 06:18:40

标签: python pandas apache-spark histogram pyspark

假设我有一个包含以下两列的数据帧(df)(Pandas)或RDD(Spark):

timestamp, data
12345.0    10 
12346.0    12

在Pandas中,我可以很容易地创建不同bin长度的binned直方图。例如,要创建超过1小时的直方图,我会执行以下操作:

df =  df[ ['timestamp', 'data'] ].set_index('timestamp')
df.resample('1H',how=sum).dropna()

从Spark RDD迁移到Pandas df对我来说相当昂贵(考虑到数据集)。因此,我更愿意尽可能地保留在Spark域中。

有没有办法在Spark RDD或数据帧中执行等效操作?

2 个答案:

答案 0 :(得分:2)

在这种特殊情况下,您只需要Unix时间戳和基本算术:

def resample_to_minute(c, interval=1):
    t = 60 * interval
    return (floor(c / t) * t).cast("timestamp")

def resample_to_hour(c, interval=1):
    return resample_to_minute(c, 60 * interval)

df = sc.parallelize([
    ("2000-01-01 00:00:00", 0), ("2000-01-01 00:01:00", 1),
    ("2000-01-01 00:02:00", 2), ("2000-01-01 00:03:00", 3),
    ("2000-01-01 00:04:00", 4), ("2000-01-01 00:05:00", 5),
    ("2000-01-01 00:06:00", 6), ("2000-01-01 00:07:00", 7),
    ("2000-01-01 00:08:00", 8)
]).toDF(["timestamp", "data"])

(df.groupBy(resample_to_minute(unix_timestamp("timestamp"), 3).alias("ts"))
    .sum().orderBy("ts").show(3, False))

## +---------------------+---------+
## |ts                   |sum(data)|
## +---------------------+---------+
## |2000-01-01 00:00:00.0|3        |
## |2000-01-01 00:03:00.0|12       |
## |2000-01-01 00:06:00.0|21       |
## +---------------------+---------+

(df.groupBy(resample_to_hour(unix_timestamp("timestamp")).alias("ts"))
    .sum().orderBy("ts").show(3, False))
## +---------------------+---------+
## |ts                   |sum(data)|
## +---------------------+---------+
## |2000-01-01 00:00:00.0|36       |
## +---------------------+---------+

来自pandas.DataFrame.resample documentation的示例数据。

一般情况下,请参阅Making histogram with Spark DataFrame column

答案 1 :(得分:0)

以下是使用RDD而不是数据框的答案:

# Generating some data to test with 
import random
import datetime

startTS = 12345.0
array = [(startTS+60*k, random.randrange(10, 20)) for k in range(150)]

# Initializing a RDD
rdd = sc.parallelize(array)

# I first map the timestamps to datetime objects so I can use the datetime.replace 
# method to round the times
formattedRDD = (rdd
                .map(lambda (ts, data): (datetime.fromtimestamp(int(ts)), data))
                .cache())

# Putting the minute and second fields to zero in datetime objects is 
# exactly like rounding per hour. You can then reduceByKey to aggregate bins.
hourlyRDD = (formattedRDD
             .map(lambda (time, msg): (time.replace(minute=0, second=0), 1))
             .reduceByKey(lambda a, b : a + b))

hourlyHisto = hourlyRDD.collect()
print hourlyHisto
> [(datetime.datetime(1970, 1, 1, 4, 0), 60), (datetime.datetime(1970, 1, 1, 5, 0), 55), (datetime.datetime(1970, 1, 1, 3, 0), 35)]

为了进行每日聚合,您可以使用time.date()而不是time.replace(...)。此外,对于从非圆形日期时间对象开始的每小时bin,您可以将原始时间增量增量到最接近的圆形小时。