Question

我在PySpark中有以下示例数据框。该列当前为日期数据类型。

scheduled_date_plus_one
12/2/2018
12/7/2018

我想重新格式化日期，并根据24小时制向其添加一个凌晨2点的时间戳。下面是我想要的数据框列输出：

scheduled_date_plus_one
2018-12-02T02:00:00Z
2018-12-07T02:00:00Z

如何实现以上目标？我知道如何在Python Pandas中执行此操作，但不熟悉PySpark。

我知道我想要的列将是字符串数据类型，因为我的值中包含“ T”和“ Z”。没关系...我想我已经知道如何将字符串数据类型转换为时间戳，所以我都在那里设置了。

Answer 1

让我们为您创建此def powerset(seq): if not seq: return ((),) else: head, *tail = seq tail_pow = powerset(tail) with_head = tuple(map(lambda t: (head,) + t, tail_pow)) return with_head + tail_pow。您必须从PySpark DataFrame模块导入to_date-

步骤0：导入这4个功能-

functions

第1步：

from pyspark.sql.functions import to_date, date_format, concat, lit

正如我们在from pyspark.sql.functions import to_date, date_format, concat, lit values = [('12/2/2018',),('12/7/2018',)] df = sqlContext.createDataFrame(values,['scheduled_date_plus_one']) df = df.withColumn('scheduled_date_plus_one',to_date('scheduled_date_plus_one','MM/dd/yyyy')) df.printSchema() root |-- scheduled_date_plus_one: date (nullable = true) df.show() +-----------------------+ |scheduled_date_plus_one| +-----------------------+ | 2018-12-02| | 2018-12-07| +-----------------------+中看到的那样，我们的日期为.printSchema()格式。因此，作为第一步，我们创建了必需的date。

步骤2：将DataFrame的格式从scheduled_date_plus_one转换为date的格式，以便我们可以将string连接到它。 T02:00:00Z将日期转换为所需格式的字符串。我们拿了date_format。

yyyy-MM-dd

上面的

df = df.withColumn('scheduled_date_plus_one',date_format('scheduled_date_plus_one',"yyyy-MM-dd")) df.printSchema() root |-- scheduled_date_plus_one: string (nullable = true) df.show() +-----------------------+ |scheduled_date_plus_one| +-----------------------+ | 2018-12-02| | 2018-12-07| +-----------------------+显示.printSchema()已转换为scheduled_date_plus_one格式，现在我们可以做string部分了。

第3步：串联-为此，我们使用concatenation函数。注意-您必须在concat函数中屏蔽T02:00:00Z，因为我们没有串联两列。

lit()

PySpark：将时间戳添加到“日期”列，并将整个列重新格式化为“时间戳”数据类型

1 个答案: