Question

我使用：Python 3.6和PySpark 2.3.0。在下面的例子中我只有item中的两个项目，但我也可以获得更多信息，例如first_name，last_name，city。

我有一个包含以下架构的数据框：

|-- email: string (nullable = true)
| -- item: struct(nullable=true)
| | -- item: array(nullable=true)
| | | -- element: struct(containsNull=true)
| | | | -- data: string(nullable=true)
| | | | -- fieldid: string(nullable=true)
| | | | -- fieldname: string(nullable=true)
| | | | -- fieldtype: string(nullable=true)

这是我的输出：

+-----+-----------------------------------------------------------------------------------------+
|email|item                                                                                     |
+-----+-----------------------------------------------------------------------------------------+
|x    |[[[Gmail, 32, Email Client, dropdown], [Device uses Proxy Server, 33, Device, dropdown]]]|
|y    |[[[IE, 32, Email Client, dropdown], [Personal computer, 33, Device, dropdown]]]          |
+-----+-----------------------------------------------------------------------------------------+

我想将此数据框转换为：

+-----+-------------------------------------+
|email|Email Client|Device                  |
+-----+-------------------------------------+
|x    |Gmail       |Device uses Proxy Server|
|y    |IE          |Personal computer       |
+-----+-------------------------------------+

我做了一些转换：

df = df.withColumn('item', df.item.item)
df = df.withColumn('column_names', df.item.fieldname)
df = df.withColumn('column_values', df.item.data)

现在我的输出是：

+-----+----------------------+---------------------------------+
|email|column_names          |column_values                    |
+-----+----------------------+---------------------------------+
|x    |[Email Client, Device]|[Gmail, Device uses Proxy Server]|
|y    |[Email Client, Device]|[IE, Personal computer]          |
+-----+----------------------+---------------------------------+

从这里我想要一个如何压缩这些列的方法。

Answer 1

您询问了如何zip数组，但实际上您可以在没有创建column_names和column_values列的中间步骤的情况下获得所需的输出。

使用getItem()功能按索引获取所需的值：

import pyspark.sql.functions as f
df = df.select(
    'email',
    f.col('item.data').getItem(0).alias('Email Client'),
    f.col('item.data').getItem(1).alias('Device')
)
df.show(truncate=False)
#+-----+------------+------------------------+
#|email|Email Client|Device                  |
#+-----+------------+------------------------+
#|x    |Gmail       |Device uses Proxy Server|
#|y    |IE          |Personal computer       |
#+-----+------------+------------------------+

这假设Email Client字段始终位于索引0且Device位于索引1处。

如果您不能假设每行中的字段始终处于相同的顺序，则另一个选项是使用{{3}从column_names和column_values中的值创建地图}。

此功能需要a：

列名称列表（字符串）或列表达式列表，这些列表达式被分组为键值对，例如（key1，value1，key2，value2，...）。

我们遍历column_names和column_values中的项目以创建对的列表，然后使用list(chain.from_iterable(...))展平列表。

完成列表后，您可以按名称选择字段。

from itertools import chain

# first create a map type column called 'map'
df.select(
    'email',
    f.create_map(
        list(
            chain.from_iterable(
                [[f.col('column_names').getItem(i), f.col('column_values').getItem(i)] 
                 for i in range(2)]
            )
        )
    ).alias('map')
)
df.show(truncte=False)
#+-----+--------------------------------------------------------------+
#|email|map                                                           |
#+-----+--------------------------------------------------------------+
#|x    |Map(Email Client -> Gmail, Device -> Device uses Proxy Server)|
#|y    |Map(Email Client -> IE, Device -> Personal computer)          |
#+-----+--------------------------------------------------------------+

# now select the fields by key
df = df.select(
    'email',
    f.col('map').getField("Email Client").alias("Email Client"),
    f.col('map').getField("Device").alias("Device")
)

这假设每个数组中始终至少有2个元素。

如果您想要压缩任意长度的列表，则必须使用udf。

# define the udf
zip_lists = f.udf(lambda x, y: [list(z) for z in zip(x, y)], ArrayType(StringType()))

# use the udf to zip the lists
df.select(
    'email',
    zip_lists(f.col('column_names'), f.col('column_values')).alias('zipped')
).show(truncate=False)
#+-----+-----------------------------------------------------------+
#|email|zipped                                                     |
#+-----+-----------------------------------------------------------+
#|x    |[[Email Client, Gmail], [Device, Device uses Proxy Server]]|
#|y    |[[Email Client, IE], [Device, Personal computer]]          |
#+-----+-----------------------------------------------------------+

或者您可以使用udf来创建地图：

make_map = f.udf(lambda x, y: dict(zip(x, y)), MapType(StringType(), StringType()))
df.select(
    'email',
    make_map(f.col('column_names'), f.col('column_values')).alias('map')
).show(truncate=False)
#+-----+--------------------------------------------------------------+
#|email|map                                                           |
#+-----+--------------------------------------------------------------+
#|x    |Map(Device -> Device uses Proxy Server, Email Client -> Gmail)|
#|y    |Map(Device -> Personal computer, Email Client -> IE)          |
#+-----+--------------------------------------------------------------+

如何在pyspark中压缩两列？

1 个答案: