连接列并在Pyspark数据框中选择一些列

时间:2018-06-22 06:03:19

标签: apache-spark pyspark

我在22-9-2018.

中有一个如下所示的数据框
pyspark

我想通过选择以下列和data = [ (1, 'a', '', 'b', '', 'c', '123_abc', 'sam', 'NY'), (2, 'b', 'abc_123', 'd', '', 'e', '', 'Tim', 'NJ'), (3, 'c', '', 'f', '', 'g', '', 'Jim', 'SFO')] df = sc.parallelize(data).toDF(["id", "abc_abled", "abc_serial", "bca_abled", "bca_serial", "cca_abled", "cca_serial", "name", "city"]) df DataFrame[id: int, abc_abled: string, abc_serial: string, bca_abled: string, bca_serial: string, cca_abled: string, cca_serial: string, name: string, city: string] df.show() +---+---------+----------+---------+----------+---------+----------+----+----+ | id|abc_abled|abc_serial|bca_abled|bca_serial|cca_abled|cca_serial|name|city| +---+---------+----------+---------+----------+---------+----------+----+----+ | 1| a| null| b| null| c| 123_abc| sam| NY| | 2| b| abc_123| d| null| e| null| Tim| NJ| | 3| c| null| f| null| g| null| Jim| SFO| +---+---------+----------+---------+----------+---------+----------+----+----+ 来创建一个新的数据框。

concatenate certain column values

此处df1 DataFrame[id:int, serial_number: string, name:string, city:string] df1.show() +---+-------------+----------+ | id|serial_number|name| city| +---+-------------+----------+ | 1| 123_abc| sam| NY| | 2| abc_123| Tim| NJ| | 3| | Jim| SFO| +---+-------------+----+-----+ 将被serial_number连接。 all columns that end with _serial

我该如何实现?

1 个答案:

答案 0 :(得分:2)

您要做的就是获取以_serial结尾的列名数组。

serialCols = [x for x in df.columns if str(x).endswith('_serial')]

然后将其与concat_ws 内置函数一起使用,以将select表达式中的列值连接为

from pyspark.sql import functions as f
df.select(
    df['id'],
    f.concat_ws('', *serialCols).alias('serial_number'),
    df['name'],
    df['city']
).show(truncate=False)

这里我使用了空字符来连接字符串

因此上述代码应为您提供

+---+-------------+----+----+
|id |serial_number|name|city|
+---+-------------+----+----+
|1  |123_abc      |sam |NY  |
|2  |abc_123      |Tim |NJ  |
|3  |             |Jim |SFO |
+---+-------------+----+----+

编辑:也可以使用pyspark.sql.functions.concat()代替concat_ws()