Question

我有这个模式：

root
 |-- _id: long (nullable = true)
 |-- _published-at: string (nullable = true)
 |-- _title: string (nullable = true)
 |-- a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _href: string (nullable = true)
 |    |    |-- _type: string (nullable = true)
 |-- p: array (nullable = true)
 |    |-- element: string (containsNull = true)

link_structure = StructType([
    StructField("_VALUE", StringType(), True),
    StructField("_href", StringType(), True),
    StructField("_type", StringType(), True)
    ])

articles_schema = StructType([
    StructField("_id", LongType(), True),
    StructField("_published-at", StringType(), True),
    StructField("_title", StringType(), True),
    StructField("a", ArrayType(link_structure), True),
    StructField("p", ArrayType(StringType()), True)])

样本数据：

+---+-------------+--------------------+--------------------+--------------------+
|_id|_published-at|              _title|                   a|                   p|
+---+-------------+--------------------+--------------------+--------------------+
| 17|   2004-07-29|SAN FRANCISCO / H...|[[Gwendolyn Tucke...|[Chief juvenile p...|
| 19|   2017-10-05|Nancy Pelosi Lies...|[[so he asked her...|[CNN recently hos...|
| 23|   2017-04-20|University leader...|[[letter, http://...|[Pro-life student...|
| 24|   2011-01-14|What Wine Prices ...|[[A new working p...|[More on:, <a>Fos...|
+---+-------------+--------------------+--------------------+--------------------+

以及用于创建新列作为每篇文章的链接数组的代码：

articles_df = articles_df.\
    withColumn("content", fun.col('p')).\
    withColumn("links", fun.array(fun.col('a._href')))

数据：

+---+--------------------+--------------------+
|_id|             content|               links|
+---+--------------------+--------------------+
| 17|[Chief juvenile p...|[[/search/?action...|
| 19|[CNN recently hos...|[[https://www.you...|
| 23|[Pro-life student...|[[http://yourstud...|
| 24|[More on:, <a>Fos...|[[http://www.imf....|
+---+--------------------+--------------------+

我需要将每篇文章的链接保存-每篇文章的单独行-，因此我在pyspark.sql explode函数中尝试使用了该链接：

df2 = articles_df.select("_id", fun.explode(fun.col('links')))

但是我得到这个结果：

+---+----+
|_id| col|
+---+----+
| 17|null|
| 19|null|
| 23|null|
| 24|null|
+---+----+

如何做到？

Pyspark爆炸SQL函数

0 个答案: