Question

使用pyspark，我正在从文件夹 contentdata2 中读取包含一个JSON对象的多个文件，

df = spark.read\
.option("mode", "DROPMALFORMED")\
.json("./data/contentdata2/")

df.printSchema()
content = df.select('fields').collect()

其中df.printSchema（）产生

root
|-- fields: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- field: string (nullable = true)
|    |    |-- type: string (nullable = true)
|    |    |-- value: string (nullable = true)
|-- id: string (nullable = true)
|-- score: double (nullable = true)
|-- siteId: string (nullable = true)

我希望访问fields.element.field，并存储等于body的每个字段，以及等于urlhash的字段（对于每个JSON对象）。

content 的格式是一行（字段），包含其他行，如下所示：

[Row(fields=[Row(field=‘body’, type=None, value=’[“First line of text“,”Second line of text”]), Row(field='urlhash', type=None, value='0a0b774c21c68325aa02cae517821e78687b2780')]),  Row(fields=[Row(field=‘body’, type=None, value=’[“First line of text“,”Second line of text”]), Row(field='urlhash', type=None, value='0a0b774c21c6caca977e7821e78687b2780')]), ...

重新出现的原因“[Row（字段 s = [Row（field = ....））是因为来自不同文件的JSON对象在一个列表中合并在一起。也是很多其他的Row元素我也不感兴趣，因此没有包含在这个例子中。

JSON对象的结构如下所示：

{
  "fields": [
    {
      "field": "body",
      "value": [
        "Some text",
        "Another line of text",
        "Third line of text."
      ]
    },
    {
      "field": "urlhash",
      "value": "0a0a341e189cf2c002cb83b2dc529fbc454f97cc"
    }
  ],
  "score": 0.87475455,
  "siteId": "9222270286501375973",
  "id": "0a0a341e189cf2c002cb83b2dc529fbc454f97cc"
}

我希望存储每个网址正文中的所有字词，以便稍后删除停用词并将其输入到K最近邻居算法中。

如何解决为每个网址存储来自正文的单词的问题，最好是作为带有列urlhash和单词（来自正文的单词列表）的tsv或csv？

Answer 1

您可以通过两种方式解决此问题：

您可以explode array每行获取一条记录，然后展平嵌套数据框
或直接访问子字段（对于Spark＆gt; 2.X）

让我们从您的示例数据框开始：

from pyspark.sql import Row
from pyspark.sql.types import *
schema = StructType([
    StructField('fields', ArrayType(StructType([
        StructField('field', StringType()), 
        StructField('type', StringType()), 
        StructField('value', StringType())])))])

content = spark.createDataFrame(
    sc.parallelize([
        Row(
            fields=[
                Row(
                    field='body', 
                    type=None, 
                    value='["First line of text","Second line of text"]'), 
                Row(
                    field='urlhash', 
                    type=None, 
                    value='0a0b774c21c68325aa02cae517821e78687b2780')]), 
        Row(
            fields=[
                Row(
                    field='body', 
                    type=None, 
                    value='["First line of text","Second line of text"]'), 
                Row(
                    field='urlhash', 
                    type=None, 
                    value='0a0b774c21c6caca977e7821e78687b2780')])]), schema=schema)
content.printSchema()

    root
     |-- fields: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- field: string (nullable = true)
     |    |    |-- type: string (nullable = true)
     |    |    |-- value: string (nullable = true)

<强> 1。爆炸和展平

可以使用.访问嵌套数据框的字段，*可以展平所有嵌套字段并将其带到root级别。

import pyspark.sql.functions as psf
content \
    .select(psf.explode('fields').alias('tmp')) \
    .select('tmp.*') \
    .show()

    +-------+----+--------------------+
    |  field|type|               value|
    +-------+----+--------------------+
    |   body|null|["First line of t...|
    |urlhash|null|0a0b774c21c68325a...|
    |   body|null|["First line of t...|
    |urlhash|null|0a0b774c21c6caca9...|
    +-------+----+--------------------+

    root
     |-- field: string (nullable = true)
     |-- type: string (nullable = true)
     |-- value: string (nullable = true)

<强> 2。直接访问子字段

在Spark的更高版本中，您可以访问嵌套StructType的字段，即使它们包含在ArrayType中也是如此。您最终会得到ArrayType子字段的值。

content \
    .select('fields.field') \
    .show()

    +---------------+
    |          field|
    +---------------+
    |[body, urlhash]|
    |[body, urlhash]|
    +---------------+

    root
     |-- field: array (nullable = true)
     |    |-- element: string (containsNull = true)

使用Pyspark访问Dataframe在Row（嵌套JSON）中的行

1 个答案: