将行附加到数据框

时间:2017-07-12 01:41:05

标签: python apache-spark pyspark

我试图在现有数据框中合并一行。 我有一个现有的架构数据框: -

StructType(List(StructField(date,TimestampType,true),
               StructField(time,StringType,>true),
               StructField(size,IntegerType,true),
               StructField(r_version,StringType,true),
               StructField(r_arch,StringType,true),
               StructField(r_os,StringType,true),
               StructField(>package,StringType,true),
               StructField(version,StringType,true),
               StructField(country>,StringType,true),
               StructField(ip_id,IntegerType,true)))

我试图向它添加一行。但是我的日期字段是时间戳类型

会面临错误
from pyspark.sql.functions import lit
d = lit('2015-12-12 00:00:00').cast("timestamp")
from pyspark.sql import Row
new = [ Row(d,'13:42:10',100,'3.2.3','i377','NA','','','DE','900') ]
df2 = spark.createDataFrame(new,s) 

这里包含架构。

我收到的错误

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Spark\python\pyspark\sql\session.py", line 522, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "C:\Spark\python\pyspark\sql\session.py", line 383, in _createFromLocal
data = list(data)
File "C:\Spark\python\pyspark\sql\session.py", line 505, in prepare
verify_func(obj, schema)
File "C:\Spark\python\pyspark\sql\types.py", line 1360, in _verify_type
_verify_type(v, f.dataType, f.nullable)
File "C:\Spark\python\pyspark\sql\types.py", line 1324, in _verify_type
raise TypeError("%s can not accept object %r in type %s" % (dataType, obj, 
type(obj)))
TypeError: TimestampType can not accept object Column<CAST(2015-12-12 
00:00:00 AS TIMESTAMP)> in type <class 'pyspark.sql.column.Column'>

1 个答案:

答案 0 :(得分:0)

有两种选择:
1。如果要在模式中预先指定时间戳,请使用dateparser解析“日期”列
2。在模式中,将“date”列定义为StringType,并在select

期间使用cast()进行转换

选项1的代码:

import time,datetime,dateparser
from pyspark.sql.functions import lit, from_unixtime
from pyspark.sql import Row,Column
from pyspark.sql.types import StructType,StructField, 
TimestampType,StringType,IntegerType

#d =  "2015-12-12 00:00:00" 
d0 =  dateparser.parse("2015-12-12 00:00:00")
t0 = int(time.mktime(d0.timetuple()))
print "dt0, ts0", d0, t0

schema= StructType([StructField("date",TimestampType(),True),  
           StructField("time",StringType(),True),   
           StructField("size",IntegerType(),True),  
           StructField("r_version",StringType(),True),  
           StructField("r_arch",StringType(),True), 
           StructField("r_os",StringType(),True),  
           StructField("rpackage",StringType(),True),  
           StructField("version",StringType(),True),  
           StructField("country",StringType(),True), 
           StructField("ip_id",IntegerType(),True)]) 


new = [ Row(d0,'13:42:10',100,'3.2.3','i377','NA','','','DE',900) ]
df2 = spark.createDataFrame(new,schema) 
df2.show()

选项2的代码:

# have "date" as StringType instead of TimestampType when creating the 
# dataframe
df2.select(df2.date.cast("timestamp").alias("date_col_ts")).show()