我正在寻找一种方法,可以在一个简单的行中从字符串列表中添加Spark DataFrames中的新列。
鉴于:
rdd = sc.parallelize([((u'2016-10-19', u'2016-293'), 40020),
((u'2016-10-19', u'2016-293'), 143938),
((u'2016-10-19', u'2016-293'), 135891225.0)
])
这是构建我的rdd并获取Dataframe的代码:
def structurate_CohortPeriod_metrics_by_OrderPeriod(line):
((OrderDate, OrderPeriod), metrics) = line
metrics = str(metrics)
return OrderDate, OrderPeriod, metrics
(rdd
.map(structurate_CohortPeriod_metrics_by_OrderPeriod)
.toDF(['OrderDate', 'OrderPeriod', 'MetricValue'])
.show())
结果:
+----------+-----------+-----------+
| OrderDate|OrderPeriod|MetricValue|
+----------+-----------+-----------+
|2016-10-19| 2016-293| 40020|
|2016-10-19| 2016-293| 143938|
|2016-10-19| 2016-293|135891225.0|
+----------+-----------+-----------+
我想添加一个新列来精确指标的名称。这就是我所做的:
def structurate_CohortPeriod_metrics_by_OrderPeriod(line):
(((OrderDate, OrderPeriod), metrics), index) = line
metrics = str(metrics)
return OrderDate, OrderPeriod, metrics, index
df1 = (rdd
.zipWithIndex()
.map(structurate_CohortPeriod_metrics_by_OrderPeriod)
.toDF(['OrderDate', 'OrderPeriod', 'MetricValue', 'index']))
然后
from pyspark.sql.types import StructType, StructField, StringType
df2 = sqlContext.createDataFrame(sc.parallelize([('0', 'UsersNb'),
('1', 'VideosNb'),
('2', 'VideosDuration')]),
StructType([StructField('index', StringType()),
StructField('MetricName', StringType())]))
df2.show()
+-----+--------------+
|index| MetricName|
+-----+--------------+
| 0| UsersNb|
| 1| VideosNb|
| 2|VideosDuration|
+-----+--------------+
最后:
(df1
.join(df2, df1.index == df2.index)
.drop(df2.index)
.select('index', 'OrderDate', 'OrderPeriod', 'MetricName', 'MetricValue')
.show())
+-----+----------+-----------+--------------+-----------+
|index| OrderDate|OrderPeriod| MetricName|MetricValue|
+-----+----------+-----------+--------------+-----------+
| 0|2016-10-19| 2016-293| VideosNb| 143938|
| 1|2016-10-19| 2016-293| UsersNb| 40020|
| 2|2016-10-19| 2016-293|VideosDuration|135891225.0|
+-----+----------+-----------+--------------+-----------+
这是我的预期输出,但此方法需要相当长的时间。我想在一两行中做到这一点。例如像点亮的方法:
from pyspark.sql.functions import lit
df1.withColumn('MetricName', lit('my_string'))
但我当然需要输入3个不同的字符串:'VideosNb','UsersNb'和'VideosDuration'。
想法?非常感谢你!