PySpark从字符串列表向DataFrame添加列

时间:2016-11-03 10:11:22

标签: python-2.7 apache-spark dataframe pyspark spark-dataframe

我正在寻找一种方法,可以在一个简单的行中从字符串列表中添加Spark DataFrames中的新列。

鉴于:

rdd = sc.parallelize([((u'2016-10-19', u'2016-293'), 40020), 
                      ((u'2016-10-19', u'2016-293'), 143938), 
                      ((u'2016-10-19', u'2016-293'), 135891225.0)
                     ])

这是构建我的rdd并获取Dataframe的代码:

def structurate_CohortPeriod_metrics_by_OrderPeriod(line):

    ((OrderDate, OrderPeriod), metrics) = line
    metrics = str(metrics)

    return OrderDate, OrderPeriod, metrics


(rdd
 .map(structurate_CohortPeriod_metrics_by_OrderPeriod)
 .toDF(['OrderDate', 'OrderPeriod', 'MetricValue'])
 .show())

结果:

+----------+-----------+-----------+
| OrderDate|OrderPeriod|MetricValue|
+----------+-----------+-----------+
|2016-10-19|   2016-293|      40020|
|2016-10-19|   2016-293|     143938|
|2016-10-19|   2016-293|135891225.0|
+----------+-----------+-----------+

我想添加一个新列来精确指标的名称。这就是我所做的:

def structurate_CohortPeriod_metrics_by_OrderPeriod(line):

    (((OrderDate, OrderPeriod), metrics), index) = line
    metrics = str(metrics)

    return OrderDate, OrderPeriod, metrics, index

df1 = (rdd
 .zipWithIndex()
 .map(structurate_CohortPeriod_metrics_by_OrderPeriod)
 .toDF(['OrderDate', 'OrderPeriod', 'MetricValue', 'index']))

然后

from pyspark.sql.types import StructType, StructField, StringType

df2 = sqlContext.createDataFrame(sc.parallelize([('0', 'UsersNb'), 
                                                 ('1', 'VideosNb'), 
                                                 ('2', 'VideosDuration')]), 
                                 StructType([StructField('index', StringType()), 
                                             StructField('MetricName', StringType())]))

df2.show()

+-----+--------------+
|index|    MetricName|
+-----+--------------+
|    0|       UsersNb|
|    1|      VideosNb|
|    2|VideosDuration|
+-----+--------------+

最后:

(df1
 .join(df2, df1.index == df2.index)
 .drop(df2.index)
 .select('index', 'OrderDate', 'OrderPeriod', 'MetricName', 'MetricValue')
 .show())

+-----+----------+-----------+--------------+-----------+
|index| OrderDate|OrderPeriod|    MetricName|MetricValue|
+-----+----------+-----------+--------------+-----------+
|    0|2016-10-19|   2016-293|      VideosNb|     143938|
|    1|2016-10-19|   2016-293|       UsersNb|      40020|
|    2|2016-10-19|   2016-293|VideosDuration|135891225.0|
+-----+----------+-----------+--------------+-----------+

这是我的预期输出,但此方法需要相当长的时间。我想在一两行中做到这一点。例如像点亮的方法:

from pyspark.sql.functions import lit

df1.withColumn('MetricName', lit('my_string'))

但我当然需要输入3个不同的字符串:'VideosNb','UsersNb'和'VideosDuration'。

想法?非常感谢你!

0 个答案:

没有答案