我有这个模式:
root
|-- _id: long (nullable = true)
|-- _published-at: string (nullable = true)
|-- _title: string (nullable = true)
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _href: string (nullable = true)
| | |-- _type: string (nullable = true)
|-- p: array (nullable = true)
| |-- element: string (containsNull = true)
link_structure = StructType([
StructField("_VALUE", StringType(), True),
StructField("_href", StringType(), True),
StructField("_type", StringType(), True)
])
articles_schema = StructType([
StructField("_id", LongType(), True),
StructField("_published-at", StringType(), True),
StructField("_title", StringType(), True),
StructField("a", ArrayType(link_structure), True),
StructField("p", ArrayType(StringType()), True)])
样本数据:
+---+-------------+--------------------+--------------------+--------------------+
|_id|_published-at| _title| a| p|
+---+-------------+--------------------+--------------------+--------------------+
| 17| 2004-07-29|SAN FRANCISCO / H...|[[Gwendolyn Tucke...|[Chief juvenile p...|
| 19| 2017-10-05|Nancy Pelosi Lies...|[[so he asked her...|[CNN recently hos...|
| 23| 2017-04-20|University leader...|[[letter, http://...|[Pro-life student...|
| 24| 2011-01-14|What Wine Prices ...|[[A new working p...|[More on:, <a>Fos...|
+---+-------------+--------------------+--------------------+--------------------+
以及用于创建新列作为每篇文章的链接数组的代码:
articles_df = articles_df.\
withColumn("content", fun.col('p')).\
withColumn("links", fun.array(fun.col('a._href')))
数据:
+---+--------------------+--------------------+
|_id| content| links|
+---+--------------------+--------------------+
| 17|[Chief juvenile p...|[[/search/?action...|
| 19|[CNN recently hos...|[[https://www.you...|
| 23|[Pro-life student...|[[http://yourstud...|
| 24|[More on:, <a>Fos...|[[http://www.imf....|
+---+--------------------+--------------------+
我需要将每篇文章的链接保存-每篇文章的单独行-,因此我在pyspark.sql explode函数中尝试使用了该链接:
df2 = articles_df.select("_id", fun.explode(fun.col('links')))
但是我得到这个结果:
+---+----+
|_id| col|
+---+----+
| 17|null|
| 19|null|
| 23|null|
| 24|null|
+---+----+
如何做到?