Spark如何处理Pandas数据帧转换期间的时间戳类型?

时间:2017-07-25 16:03:18

标签: python datetime numpy apache-spark pyspark

我有一个pandas数据帧,其时间戳列类型为pandas.tslib.Timestamp。我查看了来自' createDataFrame'(link to source)的pyspark源代码,似乎他们将数据转换为numpy记录数组到列表:

data = [r.tolist() for r in data.to_records(index=False)]

但是,时间戳类型在此过程中转换为longs列表:

> df = pd.DataFrame(pd.date_range(start=datetime.datetime.now(),periods=5,freq='s'))
> df
0 2017-07-25 11:53:29.353923
1 2017-07-25 11:53:30.353923
2 2017-07-25 11:53:31.353923
3 2017-07-25 11:53:32.353923
4 2017-07-25 11:53:33.353923
> df.to_records(index=False).tolist()
[(1500983799614193000L,), (1500983800614193000L,), (1500983801614193000L,), (1500983802614193000L,), (1500983803614193000L,)]

现在,如果我将这样的列表传递给RDD,请执行一些操作(不接触时间戳列),然后调用

> spark.createDataFrame(rdd,schema) // with schema mentioning that column as TimestampType
TypeError: TimestampType can not accept object 1465197332112000000L in type <type 'long'>
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

我应该怎么做(在将列表转换为RDD之前)以保留日期时间类型。

修改1

我所知道的一些方法将涉及后期数据帧创建处理:

  1. 将时区信息添加到pandas中的datetime对象。然而,这似乎是不必要的,并且可能导致错误,具体取决于您自己的时区。

  2. 使用日期时间库将长整数转换为时间戳。

  3.   

    假设tstampl是输入:tstamp = datetime(1970,1,1)+   timedelta(微秒= tstampl / 1000)

    1. 将日期时间转换为Pandas数据帧一侧的字符串,然后在Spark数据帧端转换为日期时间。
    2.   

      如下面Suresh的回答所述

      但是,我正在寻找一种更简单的方法,可以在创建数据帧之前处理所有处理。

1 个答案:

答案 0 :(得分:1)

我尝试将timestamp列转换为字符串类型,然后在pandas系列上应用tolist()。使用spark数据框中的列表并转换回那里的时间戳。

>>> df = pd.DataFrame(pd.date_range(start=datetime.datetime.now(),periods=5,freq='s'))
>>> df
                    0
0 2017-07-25 21:51:53.963
1 2017-07-25 21:51:54.963
2 2017-07-25 21:51:55.963
3 2017-07-25 21:51:56.963
4 2017-07-25 21:51:57.963

>>> df1 = df[0].apply(lambda x: x.strftime('%Y-%m-%d %H:%M:%S'))
>>> type(df1)
<class 'pandas.core.series.Series'>
>>> df1.tolist()
['2017-07-25 21:51:53', '2017-07-25 21:51:54', '2017-07-25 21:51:55', '2017-07-25 21:51:56', '2017-07-25 21:51:57']

 from pyspark.sql.types import StringType,TimestampType
 >>> sdf = spark.createDataFrame(df1.tolist(),StringType())
 >>> sdf.printSchema()
 root
    |-- value: string (nullable = true)
 >>> sdf = sdf.select(sdf['value'].cast('timestamp'))
 >>> sdf.printSchema()
 root
    |-- value: timestamp (nullable = true)

 >>> sdf.show(5,False)
 +---------------------+
 |value                |
 +---------------------+
 |2017-07-25 21:51:53.0|
 |2017-07-25 21:51:54.0|
 |2017-07-25 21:51:55.0|
 |2017-07-25 21:51:56.0|
 |2017-07-25 21:51:57.0|
 +---------------------+