例如,给出以下json(名为' json':
{"myTime": "2016-10-26 18:19:15"}
以及以下python脚本:
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
conf = SparkConf().setAppName('simpleTest')
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
print sc.version
json_file = 'json'
df = sqlContext.read.json(json_file,timestampFormat='yyyy-MM-dd HH:mm:ss')
df.printSchema()
输出结果为:
2.0.2
root
|-- myTime: string (nullable = true)
我希望将架构定义为时间戳。 我错过了什么?
答案 0 :(得分:0)
您需要明确定义架构:
from pyspark.sql.types import StructType, StructField, TimestampType
schema = StructType([StructField("myTime", TimestampType(), True)])
df = spark.read.json(json_file, schema=schema, timestampFormat="yyyy-MM-dd HH:mm:ss")
这将输出:
>>> df.collect()
[Row(myTime=datetime.datetime(2016, 10, 26, 18, 19, 15))]
>>> df.printSchema()
root
|-- myTime: timestamp (nullable = true)
>>>
答案 1 :(得分:0)
除了Dat Tran解决方案之外,您还可以在阅读文件后直接将cast
应用于dataframe列。
# example
from pyspark.sql import Row
json = [Row(**{"myTime": "2016-10-26 18:19:15"})]
df = spark.sparkContext.parallelize(json).toDF()
# using cast to 'timestamp' format
df_time = df.select(df['myTime'].cast('timestamp'))
df_time.printSchema()
root
|-- myTime: timestamp (nullable = true)