无法删除用于爆炸数组的列

时间:2020-05-07 07:56:19

标签: pyspark apache-spark-sql pyspark-dataframes

首先,我正在使用<?xml version="1.0" encoding="UTF-8"?> <configuration> <system.webServer> <rewrite> <rules> <rule name="rule 1"> <match url="^(.*)$" /> <action type="Rewrite" url="/index.php?url={R:1}" /> </rule> <rule name="rule 2" stopProcessing="true"> <match url="^" /> <action type="Rewrite" url="/https://%1%{REQUEST_URI}" /> </rule> </rules> </rewrite> </system.webServer> </configuration>

当从json文件中读取时,可以在以下位置找到它们:https://filebin.net/53rhhigep2zpqdga,我需要爆炸数据,爆炸后也不需要pyspark==2.4.5,并且我不需要{{ 1}}。

data

在下面,您将看到模式和statistics的前5行

spark= SparkSession.builder.master('local[2]').appName("createDataframe")\
        .getOrCreate()
json_data = spark.read.option('multiline', True).json(file_name)
json_data = json_data.withColumn("data_values", F.explode_outer("data"))\
        .drop("data", "statistics")

现在要获取所需的数据,我在下面执行查询。

json_data

上面您看到我正在创建一个名为root |-- data_values: struct (nullable = true) | |-- date: string (nullable = true) | |-- events: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- active: long (nullable = true) | | | |-- index: long (nullable = true) | | | |-- mode: long (nullable = true) | | | |-- rate: long (nullable = true) | | | |-- timestamp: string (nullable = true) +----------------------------------------------------------------------------------------------------------------------------------------------------------+ |data_values | +----------------------------------------------------------------------------------------------------------------------------------------------------------+ |[2019-02-20, [[0, 0, 1, 0, 2019-02-20T00:00:00], [0, 1, 1, 0, 2019-02-20T00:01:00], [0, 2, 1, 0, 2019-02-20T00:02:00]]] | |[2019-02-21, [[1, 0, 1, 0, 2019-02-21T00:03:00], [0, 1, 1, 0, 2019-02-21T00:04:00], [1, 2, 1, 1, 2019-02-21T00:05:00], [1, 3, 1, 1, 2019-02-21T00:06:00]]]| |[2019-02-22, [[1, 0, 1, 0, 2019-02-22T00:03:00], [0, 1, 1, 0, 2019-02-22T00:04:00], [1, 2, 1, 1, 2019-02-22T00:05:00], [1, 3, 1, 1, 2019-02-22T00:06:00]]]| |[2019-02-23, [[1, 3, 1, 1, 2019-02-23T00:16:00]]] | |[2019-02-24, [[1, 0, 1, 1, 2019-02-24T00:03:00], [1, 1, 1, 0, 2019-02-24T00:04:00]]] | +----------------------------------------------------------------------------------------------------------------------------------------------------------+ 的列,该列会爆炸我的传入json数据。然后,我创建列以从newData = json_data\ .withColumn("events", F.explode(json_data.data_values.events))\ .withColumn("date", json_data.data_values.date) newData.printSchema() newData.show(3) finalData = newData.drop("data_values") finalData.show(6) 中提取事件和日期。在下面,您将看到架构的外观以及前5行。

data_values

当我拥有所需的数据框时,然后尝试删除data_values,但出现此错误:

root
 |-- data_values: struct (nullable = true)
 |    |-- date: string (nullable = true)
 |    |-- events: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- active: long (nullable = true)
 |    |    |    |-- index: long (nullable = true)
 |    |    |    |-- mode: long (nullable = true)
 |    |    |    |-- rate: long (nullable = true)
 |    |    |    |-- timestamp: string (nullable = true)
 |-- events: struct (nullable = true)
 |    |-- active: long (nullable = true)
 |    |-- index: long (nullable = true)
 |    |-- mode: long (nullable = true)
 |    |-- rate: long (nullable = true)
 |    |-- timestamp: string (nullable = true)
 |-- date: string (nullable = true)

+----------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+----------+
|data_values                                                                                                                                               |events                           |date      |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+----------+
|[2019-02-20, [[0, 0, 1, 0, 2019-02-20T00:00:00], [0, 1, 1, 0, 2019-02-20T00:01:00], [0, 2, 1, 0, 2019-02-20T00:02:00]]]                                   |[0, 0, 1, 0, 2019-02-20T00:00:00]|2019-02-20|
|[2019-02-20, [[0, 0, 1, 0, 2019-02-20T00:00:00], [0, 1, 1, 0, 2019-02-20T00:01:00], [0, 2, 1, 0, 2019-02-20T00:02:00]]]                                   |[0, 1, 1, 0, 2019-02-20T00:01:00]|2019-02-20|
|[2019-02-20, [[0, 0, 1, 0, 2019-02-20T00:00:00], [0, 1, 1, 0, 2019-02-20T00:01:00], [0, 2, 1, 0, 2019-02-20T00:02:00]]]                                   |[0, 2, 1, 0, 2019-02-20T00:02:00]|2019-02-20|
|[2019-02-21, [[1, 0, 1, 0, 2019-02-21T00:03:00], [0, 1, 1, 0, 2019-02-21T00:04:00], [1, 2, 1, 1, 2019-02-21T00:05:00], [1, 3, 1, 1, 2019-02-21T00:06:00]]]|[1, 0, 1, 0, 2019-02-21T00:03:00]|2019-02-21|
|[2019-02-21, [[1, 0, 1, 0, 2019-02-21T00:03:00], [0, 1, 1, 0, 2019-02-21T00:04:00], [1, 2, 1, 1, 2019-02-21T00:05:00], [1, 3, 1, 1, 2019-02-21T00:06:00]]]|[0, 1, 1, 0, 2019-02-21T00:04:00]|2019-02-21|
+----------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+----------+

data_values的架构具有我想要的字段,但是执行py4j.protocol.Py4JJavaError: An error occurred while calling o58.showString. : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: _gen_alias_25#25 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:291) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:275) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74) at org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96) at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:63) at org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:193) at org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:148) at org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:495) at org.apache.spark.sql.execution.InputRDDCodegen.doProduce(WholeStageCodegenExec.scala:482) at org.apache.spark.sql.execution.InputRDDCodegen.doProduce$(WholeStageCodegenExec.scala:455) at org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:495) at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:94) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:211) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:208) at org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:89) at org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:89) at org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:495) at org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:49) at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:94) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:211) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:208) at org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:89) at org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:89) at org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:39) at org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:629) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:689) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:173) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:211) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:208) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:169) at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:161) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3482) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2581) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3472) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$4(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3468) at org.apache.spark.sql.Dataset.head(Dataset.scala:2581) at org.apache.spark.sql.Dataset.take(Dataset.scala:2788) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:297) at org.apache.spark.sql.Dataset.showString(Dataset.scala:334) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.lang.RuntimeException: Couldn't find _gen_alias_25#25 in [data_values#5] at scala.sys.package$.error(package.scala:30) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.$anonfun$applyOrElse$1(BoundAttribute.scala:81) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) 会导致上面的错误。

newData

我尝试在删除newData.show(3)之后运行.show()之后创建一个新的数据帧,但是仍然遇到相同的问题。我猜测由于某种原因,我创建的新列仍在引用root |-- events: struct (nullable = true) | |-- active: long (nullable = true) | |-- rate: long (nullable = true) | |-- index: long (nullable = true) | |-- mode: long (nullable = true) | |-- timestamp: string (nullable = true) |-- date: string (nullable = true) ,所以也许我创建这些列的方式是错误的?

我曾尝试查找在线存在相同问题的人,但这似乎不是一个常见问题。由于没有关于data_values错误的信息。

1 个答案:

答案 0 :(得分:1)

我在这里使用spark 2.4.3

要点1:更新路径

>>> from pyspark.sql import functions as F
>>> json_data = spark.read.option('multiline', True).json("/home/maheshpersonal/stack.json")
>>> json_data.show(truncate=False)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|data                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[2019-02-20, [[0, 0, 1, 0, 2019-02-20T00:00:00], [0, 1, 1, 0, 2019-02-20T00:01:00], [0, 2, 1, 0, 2019-02-20T00:02:00]]], [2019-02-21, [[1, 0, 1, 0, 2019-02-21T00:03:00], [0, 1, 1, 0, 2019-02-21T00:04:00], [1, 2, 1, 1, 2019-02-21T00:05:00], [1, 3, 1, 1, 2019-02-21T00:06:00]]], [2019-02-22, [[1, 0, 1, 0, 2019-02-22T00:03:00], [0, 1, 1, 0, 2019-02-22T00:04:00], [1, 2, 1, 1, 2019-02-22T00:05:00], [1, 3, 1, 1, 2019-02-22T00:06:00]]], [2019-02-23, [[1, 3, 1, 1, 2019-02-23T00:16:00]]], [2019-02-24, [[1, 0, 1, 1, 2019-02-24T00:03:00], [1, 1, 1, 0, 2019-02-24T00:04:00]]]]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

要点:2检查架构

>>> json_data.printSchema()
root
 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- date: string (nullable = true)
 |    |    |-- events: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- active: long (nullable = true)
 |    |    |    |    |-- index: long (nullable = true)
 |    |    |    |    |-- mode: long (nullable = true)
 |    |    |    |    |-- rate: long (nullable = true)
 |    |    |    |    |-- timestamp: string (nullable = true)

第3点:爆炸数据列

>>> json_data_1 = json_data.withColumn("data_values", F.explode_outer("data"))
>>> json_data_1.printSchema ()
root
 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- date: string (nullable = true)
 |    |    |-- events: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- active: long (nullable = true)
 |    |    |    |    |-- index: long (nullable = true)
 |    |    |    |    |-- mode: long (nullable = true)
 |    |    |    |    |-- rate: long (nullable = true)
 |    |    |    |    |-- timestamp: string (nullable = true)
 |-- data_values: struct (nullable = true)
 |    |-- date: string (nullable = true)
 |    |-- events: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- active: long (nullable = true)
 |    |    |    |-- index: long (nullable = true)
 |    |    |    |-- mode: long (nullable = true)
 |    |    |    |-- rate: long (nullable = true)
 |    |    |    |-- timestamp: string (nullable = true)

第4点:根据要求选择列

 >>> newData = json_data_1.withColumn("events", json_data_1.data_values.events).withColumn("date", json_data_1.data_values.date)

 >>> newData.show()
    +--------------------+--------------------+--------------------+----------+
    |                data|         data_values|              events|      date|
    +--------------------+--------------------+--------------------+----------+
    |[[2019-02-20, [[0...|[2019-02-20, [[0,...|[[0, 0, 1, 0, 201...|2019-02-20|
    |[[2019-02-20, [[0...|[2019-02-21, [[1,...|[[1, 0, 1, 0, 201...|2019-02-21|
    |[[2019-02-20, [[0...|[2019-02-22, [[1,...|[[1, 0, 1, 0, 201...|2019-02-22|
    |[[2019-02-20, [[0...|[2019-02-23, [[1,...|[[1, 3, 1, 1, 201...|2019-02-23|
    |[[2019-02-20, [[0...|[2019-02-24, [[1,...|[[1, 0, 1, 1, 201...|2019-02-24|
    +--------------------+--------------------+--------------------+----------+

第5点:从数据框中删除数据列

>>> newData_v1 = newData.drop(newData.data)
>>> newData_v1.show()
+--------------------+--------------------+----------+
|         data_values|              events|      date|
+--------------------+--------------------+----------+
|[2019-02-20, [[0,...|[[0, 0, 1, 0, 201...|2019-02-20|
|[2019-02-21, [[1,...|[[1, 0, 1, 0, 201...|2019-02-21|
|[2019-02-22, [[1,...|[[1, 0, 1, 0, 201...|2019-02-22|
|[2019-02-23, [[1,...|[[1, 3, 1, 1, 201...|2019-02-23|
|[2019-02-24, [[1,...|[[1, 0, 1, 1, 201...|2019-02-24|
+--------------------+--------------------+----------+

第6点:从newData_v1中删除data_values列

>>> finalDataframe = newData_v1.drop(newData_v1.data_values)
>>> finalDataframe.show(truncate = False)
+--------------------------------------------------------------------------------------------------------------------------------------------+----------+
|events                                                                                                                                      |date      |
+--------------------------------------------------------------------------------------------------------------------------------------------+----------+
|[[0, 0, 1, 0, 2019-02-20T00:00:00], [0, 1, 1, 0, 2019-02-20T00:01:00], [0, 2, 1, 0, 2019-02-20T00:02:00]]                                   |2019-02-20|
|[[1, 0, 1, 0, 2019-02-21T00:03:00], [0, 1, 1, 0, 2019-02-21T00:04:00], [1, 2, 1, 1, 2019-02-21T00:05:00], [1, 3, 1, 1, 2019-02-21T00:06:00]]|2019-02-21|
|[[1, 0, 1, 0, 2019-02-22T00:03:00], [0, 1, 1, 0, 2019-02-22T00:04:00], [1, 2, 1, 1, 2019-02-22T00:05:00], [1, 3, 1, 1, 2019-02-22T00:06:00]]|2019-02-22|
|[[1, 3, 1, 1, 2019-02-23T00:16:00]]                                                                                                         |2019-02-23|
|[[1, 0, 1, 1, 2019-02-24T00:03:00], [1, 1, 1, 0, 2019-02-24T00:04:00]]                                                                      |2019-02-24|
+--------------------------------------------------------------------------------------------------------------------------------------------+----------+

从中学习总是使用新的数据框来存储转换。请检查是否对您有帮助:)