Question

我的问题是如何将列拆分为多列。我不知道为什么df.toPandas()不起作用。

例如，我想将'df_test'更改为'df_test2'。我看到很多使用pandas模块的例子。还有另外一种方法吗？提前谢谢。

df_test = sqlContext.createDataFrame([
(1, '14-Jul-15'),
(2, '14-Jun-15'),
(3, '11-Oct-15'),
], ('id', 'date'))

df_test2

id     day    month    year
1       14     Jul      15
2       14     Jun      15
1       11     Oct      15

Answer 1

Spark＆gt; = 2.2

您可以跳过unix_timestamp并投射并使用to_date或to_timestamp：

from pyspark.sql.functions import to_date, to_timestamp

df_test.withColumn("date", to_date("date", "dd-MMM-yy")).show()
## +---+----------+
## | id|      date|
## +---+----------+
## |  1|2015-07-14|
## |  2|2015-06-14|
## |  3|2015-10-11|
## +---+----------+


df_test.withColumn("date", to_timestamp("date", "dd-MMM-yy")).show()
## +---+-------------------+
## | id|               date|
## +---+-------------------+
## |  1|2015-07-14 00:00:00|
## |  2|2015-06-14 00:00:00|
## |  3|2015-10-11 00:00:00|
## +---+-------------------+

然后应用下面显示的其他日期时间函数。

Spark＆lt; 2.2

无法在单次访问中派生多个顶级列。您可以将结构或集合类型与UDF一起使用：

from pyspark.sql.types import StringType, StructType, StructField from pyspark.sql import Row from pyspark.sql.functions import udf, col schema = StructType([ StructField("day", StringType(), True), StructField("month", StringType(), True), StructField("year", StringType(), True) ]) def split_date_(s): try: d, m, y = s.split("-") return d, m, y except: return None split_date = udf(split_date_, schema) transformed = df_test.withColumn("date", split_date(col("date"))) transformed.printSchema() ## root ## |-- id: long (nullable = true) ## |-- date: struct (nullable = true) ## | |-- day: string (nullable = true) ## | |-- month: string (nullable = true) ## | |-- year: string (nullable = true)

但它不仅在PySpark中非常冗长，而且价格昂贵。

对于基于日期的转换，您只需使用内置函数：

from pyspark.sql.functions import unix_timestamp, dayofmonth, year, date_format transformed = (df_test .withColumn("ts", unix_timestamp(col("date"), "dd-MMM-yy").cast("timestamp")) .withColumn("day", dayofmonth(col("ts")).cast("string")) .withColumn("month", date_format(col("ts"), "MMM")) .withColumn("year", year(col("ts")).cast("string")) .drop("ts"))

同样，您可以使用regexp_extract分割日期字符串。

另见Derive multiple columns from a single column in a Spark DataFrame

注意：

如果您使用未针对SPARK-11724修补的版本，则需要在unix_timestamp(...)之后和cast("timestamp")之前进行修正。

Answer 2

这里的解决方案是使用pyspark.sql.functions.split（）函数。

df = sqlContext.createDataFrame([
(1, '14-Jul-15'),
(2, '14-Jun-15'),
(3, '11-Oct-15'),
], ('id', 'date'))

split_col = pyspark.sql.functions.split(df['date'], '-')
df = df.withColumn('day', split_col.getItem(0))
df = df.withColumn('month', split_col.getItem(1))
df = df.withColumn('year', split_col.getItem(2))
df = df.drop("date")

pyspark将列拆分为多个没有pandas的列

2 个答案: