Question

我是PySpark的新手。

我有一个具有以下架构的JSON文件

awk -v STEP=3 '
{
    for (i=0; i<NF; i++) {
        x = STEP - i%STEP;
        a[x] = a[x] ? a[x] OFS $(i+1) : $(i+1);
    }
    printf "%sArray[%d]\n", (FNR>1 ? "\n" : ""), FNR;
    for (i=1; i<=STEP; i++) {
        print a[i];
        a[i]="";
    }
}
' <<EOD
1 2 3 4 5 6
2 4 5 3 6 7
5 3 2 4 5 1
3 5 2 5 5 1
EOD

我想要一个新的结果数据框，该数据框应只有两列 type 和 UrlsInfo.element.DisplayUrl

这是我的try代码，没有给出预期的输出

df = spark.read.json(input_file)

df.printSchema()

 |-- UrlsInfo: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- displayUrl: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |-- type: long (nullable = true)

我希望resultDF像这样：

  df.createOrReplaceTempView("the_table")  
  resultDF = spark.sql("SELECT type, UrlsInfo.element.DisplayUrl FROM the_table")
  resultDF.show()

这与JSON file parsing in Pyspark相关，但没有回答我的问题。

Answer 1

如您在架构中所见，df2.loc['value',df2.loc['location']=='nyc'].iloc[0] Out[270]: '5'是数组类型，而不是结构。因此，“元素”模式项不是指命名属性（您正在尝试通过UrlsInfo访问它），而是指数组元素（其响应像.element这样的索引）。

我手动复制了您的架构：

[0]

并且我可以通过使用索引来生成一个表，就像您正在寻找的表一样：

from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit")], Type=2)])
df.printSchema()

root
 |-- Type: long (nullable = true)
 |-- UrlsInfo: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- displayUri: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- url: string (nullable = true)

编辑：我忘记了the EXPLODE function，您可以在这里使用它来将UrlsInfo元素视为一组行：

UrlsInfo

在PySpark中如何解析嵌入式JSON

1 个答案: