我在Jupyter Notebook中使用python / pySpark,我想弄清楚以下内容:
我有一个像
这样的数据框MainDate Date1 Date2 Date3 Date4
2015-10-25 2015-09-25 2015-10-25 2015-11-25 2015-12-25
2012-07-16 2012-04-16 2012-05-16 2012-06-16 2012-07-16
2005-03-14 2005-07-14 2005-08-14 2005-09-14 2005-10-14
我需要将MainDate与Date1-Date4中的每一个进行比较,如果MainDate == Date#则创建一个新列REAL = Date#,如果没有匹配则REAL =" None",所有日期都是日期格式,真实数据帧也有Date1到Date72,如果有任何匹配,可能只有一个匹配
最终结果:
MainDate Date1 Date2 Date3 Date4 REAL
2015-10-25 2015-09-25 2015-10-25 2015-11-25 2015-12-25 Date2
2012-07-16 2012-04-16 2012-05-16 2012-06-16 2012-07-16 Date4
2005-03-14 2005-07-14 2005-08-14 2005-09-14 2005-10-14 None
提前致谢
答案 0 :(得分:2)
我使用coalesce
:
from pyspark.sql.functions import col, when, coalesce, lit
df = spark.createDataFrame([
("2015-10-25", "2015-09-25", "2015-10-25", "2015-11-25", "2015-12-25"),
("2012-07-16", "2012-04-16", "2012-05-16", "2012-06-16", "2012-07-16"),
("2005-03-14", "2005-07-14", "2005-08-14", "2005-09-14", "2005-10-14"),],
("MainDate", "Date1", "Date2", "Date3", "Date4")
)
df.withColumn("REAL",
coalesce(*[when(col(c) == col("MainDate"), lit(c)) for c in df.columns[1:]])
).show()
+----------+----------+----------+----------+----------+-----+
| MainDate| Date1| Date2| Date3| Date4| REAL|
+----------+----------+----------+----------+----------+-----+
|2015-10-25|2015-09-25|2015-10-25|2015-11-25|2015-12-25|Date2|
|2012-07-16|2012-04-16|2012-05-16|2012-06-16|2012-07-16|Date4|
|2005-03-14|2005-07-14|2005-08-14|2005-09-14|2005-10-14| null|
+----------+----------+----------+----------+----------+-----+
,其中
when(col(c) == col("MainDate"), lit(c))
如果匹配则返回列名(lit(c)
),否则返回NULL
。
这应该比udf
或转换为RDD
快得多。
答案 1 :(得分:1)
您可以将数据框转换为rdd,通过选中与MainDate
匹配的列Date,将新字段附加到每一行:
df = spark.read.option("header", True).option("inferSchema", True).csv("test.csv")
from pyspark.sql import Row
from pyspark.sql.types import StringType
# get the list of columns you want to compare with MainDate
dates = [col for col in df.columns if col.startswith('Date')]
# for each row loop through the dates column and find the match, if nothing matches, return None
rdd = df.rdd.map(lambda row: row + Row(REAL = next((col for col in dates if row[col] == row['MainDate']), None)))
# recreate the data frame from the rdd
spark.createDataFrame(rdd, df.schema.add("REAL", StringType(), True)).show()
+--------------------+--------------------+--------------------+--------------------+--------------------+-----+
| MainDate| Date1| Date2| Date3| Date4| REAL|
+--------------------+--------------------+--------------------+--------------------+--------------------+-----+
|2015-10-25 00:00:...|2015-09-25 00:00:...|2015-10-25 00:00:...|2015-11-25 00:00:...|2015-12-25 00:00:...|Date2|
|2012-07-16 00:00:...|2012-04-16 00:00:...|2012-05-16 00:00:...|2012-06-16 00:00:...|2012-07-16 00:00:...|Date4|
|2005-03-14 00:00:...|2005-07-14 00:00:...|2005-08-14 00:00:...|2005-09-14 00:00:...|2005-10-14 00:00:...| null|
+--------------------+--------------------+--------------------+--------------------+--------------------+-----+