我有一个pandas数据帧,其时间戳列类型为pandas.tslib.Timestamp。我查看了来自' createDataFrame'(link to source)的pyspark源代码,似乎他们将数据转换为numpy记录数组到列表:
data = [r.tolist() for r in data.to_records(index=False)]
但是,时间戳类型在此过程中转换为longs列表:
> df = pd.DataFrame(pd.date_range(start=datetime.datetime.now(),periods=5,freq='s'))
> df
0 2017-07-25 11:53:29.353923
1 2017-07-25 11:53:30.353923
2 2017-07-25 11:53:31.353923
3 2017-07-25 11:53:32.353923
4 2017-07-25 11:53:33.353923
> df.to_records(index=False).tolist()
[(1500983799614193000L,), (1500983800614193000L,), (1500983801614193000L,), (1500983802614193000L,), (1500983803614193000L,)]
现在,如果我将这样的列表传递给RDD,请执行一些操作(不接触时间戳列),然后调用
> spark.createDataFrame(rdd,schema) // with schema mentioning that column as TimestampType
TypeError: TimestampType can not accept object 1465197332112000000L in type <type 'long'>
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
我应该怎么做(在将列表转换为RDD之前)以保留日期时间类型。
修改1
我所知道的一些方法将涉及后期数据帧创建处理:
将时区信息添加到pandas中的datetime对象。然而,这似乎是不必要的,并且可能导致错误,具体取决于您自己的时区。
使用日期时间库将长整数转换为时间戳。
假设tstampl是输入:tstamp = datetime(1970,1,1)+ timedelta(微秒= tstampl / 1000)
如下面Suresh的回答所述
但是,我正在寻找一种更简单的方法,可以在创建数据帧之前处理所有处理。
答案 0 :(得分:1)
我尝试将timestamp列转换为字符串类型,然后在pandas系列上应用tolist()。使用spark数据框中的列表并转换回那里的时间戳。
>>> df = pd.DataFrame(pd.date_range(start=datetime.datetime.now(),periods=5,freq='s'))
>>> df
0
0 2017-07-25 21:51:53.963
1 2017-07-25 21:51:54.963
2 2017-07-25 21:51:55.963
3 2017-07-25 21:51:56.963
4 2017-07-25 21:51:57.963
>>> df1 = df[0].apply(lambda x: x.strftime('%Y-%m-%d %H:%M:%S'))
>>> type(df1)
<class 'pandas.core.series.Series'>
>>> df1.tolist()
['2017-07-25 21:51:53', '2017-07-25 21:51:54', '2017-07-25 21:51:55', '2017-07-25 21:51:56', '2017-07-25 21:51:57']
from pyspark.sql.types import StringType,TimestampType
>>> sdf = spark.createDataFrame(df1.tolist(),StringType())
>>> sdf.printSchema()
root
|-- value: string (nullable = true)
>>> sdf = sdf.select(sdf['value'].cast('timestamp'))
>>> sdf.printSchema()
root
|-- value: timestamp (nullable = true)
>>> sdf.show(5,False)
+---------------------+
|value |
+---------------------+
|2017-07-25 21:51:53.0|
|2017-07-25 21:51:54.0|
|2017-07-25 21:51:55.0|
|2017-07-25 21:51:56.0|
|2017-07-25 21:51:57.0|
+---------------------+