我试图在现有数据框中合并一行。 我有一个现有的架构数据框: -
StructType(List(StructField(date,TimestampType,true),
StructField(time,StringType,>true),
StructField(size,IntegerType,true),
StructField(r_version,StringType,true),
StructField(r_arch,StringType,true),
StructField(r_os,StringType,true),
StructField(>package,StringType,true),
StructField(version,StringType,true),
StructField(country>,StringType,true),
StructField(ip_id,IntegerType,true)))
我试图向它添加一行。但是我的日期字段是时间戳类型
会面临错误from pyspark.sql.functions import lit
d = lit('2015-12-12 00:00:00').cast("timestamp")
from pyspark.sql import Row
new = [ Row(d,'13:42:10',100,'3.2.3','i377','NA','','','DE','900') ]
df2 = spark.createDataFrame(new,s)
这里包含架构。
我收到的错误
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Spark\python\pyspark\sql\session.py", line 522, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "C:\Spark\python\pyspark\sql\session.py", line 383, in _createFromLocal
data = list(data)
File "C:\Spark\python\pyspark\sql\session.py", line 505, in prepare
verify_func(obj, schema)
File "C:\Spark\python\pyspark\sql\types.py", line 1360, in _verify_type
_verify_type(v, f.dataType, f.nullable)
File "C:\Spark\python\pyspark\sql\types.py", line 1324, in _verify_type
raise TypeError("%s can not accept object %r in type %s" % (dataType, obj,
type(obj)))
TypeError: TimestampType can not accept object Column<CAST(2015-12-12
00:00:00 AS TIMESTAMP)> in type <class 'pyspark.sql.column.Column'>
答案 0 :(得分:0)
有两种选择:
1。如果要在模式中预先指定时间戳,请使用dateparser解析“日期”列
2。在模式中,将“date”列定义为StringType,并在select
选项1的代码:
import time,datetime,dateparser
from pyspark.sql.functions import lit, from_unixtime
from pyspark.sql import Row,Column
from pyspark.sql.types import StructType,StructField,
TimestampType,StringType,IntegerType
#d = "2015-12-12 00:00:00"
d0 = dateparser.parse("2015-12-12 00:00:00")
t0 = int(time.mktime(d0.timetuple()))
print "dt0, ts0", d0, t0
schema= StructType([StructField("date",TimestampType(),True),
StructField("time",StringType(),True),
StructField("size",IntegerType(),True),
StructField("r_version",StringType(),True),
StructField("r_arch",StringType(),True),
StructField("r_os",StringType(),True),
StructField("rpackage",StringType(),True),
StructField("version",StringType(),True),
StructField("country",StringType(),True),
StructField("ip_id",IntegerType(),True)])
new = [ Row(d0,'13:42:10',100,'3.2.3','i377','NA','','','DE',900) ]
df2 = spark.createDataFrame(new,schema)
df2.show()
选项2的代码:
# have "date" as StringType instead of TimestampType when creating the
# dataframe
df2.select(df2.date.cast("timestamp").alias("date_col_ts")).show()