首先,我正在使用<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<system.webServer>
<rewrite>
<rules>
<rule name="rule 1">
<match url="^(.*)$" />
<action type="Rewrite" url="/index.php?url={R:1}" />
</rule>
<rule name="rule 2" stopProcessing="true">
<match url="^" />
<action type="Rewrite" url="/https://%1%{REQUEST_URI}" />
</rule>
</rules>
</rewrite>
</system.webServer>
</configuration>
。
当从json文件中读取时,可以在以下位置找到它们:https://filebin.net/53rhhigep2zpqdga,我需要爆炸数据,爆炸后也不需要pyspark==2.4.5
,并且我不需要{{ 1}}。
data
在下面,您将看到模式和statistics
的前5行
spark= SparkSession.builder.master('local[2]').appName("createDataframe")\
.getOrCreate()
json_data = spark.read.option('multiline', True).json(file_name)
json_data = json_data.withColumn("data_values", F.explode_outer("data"))\
.drop("data", "statistics")
现在要获取所需的数据,我在下面执行查询。
json_data
上面您看到我正在创建一个名为root
|-- data_values: struct (nullable = true)
| |-- date: string (nullable = true)
| |-- events: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- active: long (nullable = true)
| | | |-- index: long (nullable = true)
| | | |-- mode: long (nullable = true)
| | | |-- rate: long (nullable = true)
| | | |-- timestamp: string (nullable = true)
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
|data_values |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
|[2019-02-20, [[0, 0, 1, 0, 2019-02-20T00:00:00], [0, 1, 1, 0, 2019-02-20T00:01:00], [0, 2, 1, 0, 2019-02-20T00:02:00]]] |
|[2019-02-21, [[1, 0, 1, 0, 2019-02-21T00:03:00], [0, 1, 1, 0, 2019-02-21T00:04:00], [1, 2, 1, 1, 2019-02-21T00:05:00], [1, 3, 1, 1, 2019-02-21T00:06:00]]]|
|[2019-02-22, [[1, 0, 1, 0, 2019-02-22T00:03:00], [0, 1, 1, 0, 2019-02-22T00:04:00], [1, 2, 1, 1, 2019-02-22T00:05:00], [1, 3, 1, 1, 2019-02-22T00:06:00]]]|
|[2019-02-23, [[1, 3, 1, 1, 2019-02-23T00:16:00]]] |
|[2019-02-24, [[1, 0, 1, 1, 2019-02-24T00:03:00], [1, 1, 1, 0, 2019-02-24T00:04:00]]] |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
的列,该列会爆炸我的传入json数据。然后,我创建列以从newData = json_data\
.withColumn("events", F.explode(json_data.data_values.events))\
.withColumn("date", json_data.data_values.date)
newData.printSchema()
newData.show(3)
finalData = newData.drop("data_values")
finalData.show(6)
中提取事件和日期。在下面,您将看到架构的外观以及前5行。
data_values
当我拥有所需的数据框时,然后尝试删除data_values
,但出现此错误:
root
|-- data_values: struct (nullable = true)
| |-- date: string (nullable = true)
| |-- events: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- active: long (nullable = true)
| | | |-- index: long (nullable = true)
| | | |-- mode: long (nullable = true)
| | | |-- rate: long (nullable = true)
| | | |-- timestamp: string (nullable = true)
|-- events: struct (nullable = true)
| |-- active: long (nullable = true)
| |-- index: long (nullable = true)
| |-- mode: long (nullable = true)
| |-- rate: long (nullable = true)
| |-- timestamp: string (nullable = true)
|-- date: string (nullable = true)
+----------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+----------+
|data_values |events |date |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+----------+
|[2019-02-20, [[0, 0, 1, 0, 2019-02-20T00:00:00], [0, 1, 1, 0, 2019-02-20T00:01:00], [0, 2, 1, 0, 2019-02-20T00:02:00]]] |[0, 0, 1, 0, 2019-02-20T00:00:00]|2019-02-20|
|[2019-02-20, [[0, 0, 1, 0, 2019-02-20T00:00:00], [0, 1, 1, 0, 2019-02-20T00:01:00], [0, 2, 1, 0, 2019-02-20T00:02:00]]] |[0, 1, 1, 0, 2019-02-20T00:01:00]|2019-02-20|
|[2019-02-20, [[0, 0, 1, 0, 2019-02-20T00:00:00], [0, 1, 1, 0, 2019-02-20T00:01:00], [0, 2, 1, 0, 2019-02-20T00:02:00]]] |[0, 2, 1, 0, 2019-02-20T00:02:00]|2019-02-20|
|[2019-02-21, [[1, 0, 1, 0, 2019-02-21T00:03:00], [0, 1, 1, 0, 2019-02-21T00:04:00], [1, 2, 1, 1, 2019-02-21T00:05:00], [1, 3, 1, 1, 2019-02-21T00:06:00]]]|[1, 0, 1, 0, 2019-02-21T00:03:00]|2019-02-21|
|[2019-02-21, [[1, 0, 1, 0, 2019-02-21T00:03:00], [0, 1, 1, 0, 2019-02-21T00:04:00], [1, 2, 1, 1, 2019-02-21T00:05:00], [1, 3, 1, 1, 2019-02-21T00:06:00]]]|[0, 1, 1, 0, 2019-02-21T00:04:00]|2019-02-21|
+----------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+----------+
data_values
的架构具有我想要的字段,但是执行py4j.protocol.Py4JJavaError: An error occurred while calling o58.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: _gen_alias_25#25
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:291)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:275)
at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74)
at org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.immutable.List.map(List.scala:298)
at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96)
at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:63)
at org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:193)
at org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:148)
at org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:495)
at org.apache.spark.sql.execution.InputRDDCodegen.doProduce(WholeStageCodegenExec.scala:482)
at org.apache.spark.sql.execution.InputRDDCodegen.doProduce$(WholeStageCodegenExec.scala:455)
at org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:495)
at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:94)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:211)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:208)
at org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:89)
at org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:89)
at org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:495)
at org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:49)
at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:94)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:211)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:208)
at org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:89)
at org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:89)
at org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:39)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:629)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:689)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:173)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:211)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:208)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:169)
at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:161)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3482)
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2581)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3472)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$4(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3468)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2581)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2788)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:297)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:334)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.RuntimeException: Couldn't find _gen_alias_25#25 in [data_values#5]
at scala.sys.package$.error(package.scala:30)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.$anonfun$applyOrElse$1(BoundAttribute.scala:81)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
会导致上面的错误。
newData
我尝试在删除newData.show(3)
之后运行.show()之后创建一个新的数据帧,但是仍然遇到相同的问题。我猜测由于某种原因,我创建的新列仍在引用root
|-- events: struct (nullable = true)
| |-- active: long (nullable = true)
| |-- rate: long (nullable = true)
| |-- index: long (nullable = true)
| |-- mode: long (nullable = true)
| |-- timestamp: string (nullable = true)
|-- date: string (nullable = true)
,所以也许我创建这些列的方式是错误的?
我曾尝试查找在线存在相同问题的人,但这似乎不是一个常见问题。由于没有关于data_values
错误的信息。
答案 0 :(得分:1)
我在这里使用spark 2.4.3
要点1:更新路径
>>> from pyspark.sql import functions as F
>>> json_data = spark.read.option('multiline', True).json("/home/maheshpersonal/stack.json")
>>> json_data.show(truncate=False)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|data |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[2019-02-20, [[0, 0, 1, 0, 2019-02-20T00:00:00], [0, 1, 1, 0, 2019-02-20T00:01:00], [0, 2, 1, 0, 2019-02-20T00:02:00]]], [2019-02-21, [[1, 0, 1, 0, 2019-02-21T00:03:00], [0, 1, 1, 0, 2019-02-21T00:04:00], [1, 2, 1, 1, 2019-02-21T00:05:00], [1, 3, 1, 1, 2019-02-21T00:06:00]]], [2019-02-22, [[1, 0, 1, 0, 2019-02-22T00:03:00], [0, 1, 1, 0, 2019-02-22T00:04:00], [1, 2, 1, 1, 2019-02-22T00:05:00], [1, 3, 1, 1, 2019-02-22T00:06:00]]], [2019-02-23, [[1, 3, 1, 1, 2019-02-23T00:16:00]]], [2019-02-24, [[1, 0, 1, 1, 2019-02-24T00:03:00], [1, 1, 1, 0, 2019-02-24T00:04:00]]]]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
要点:2检查架构
>>> json_data.printSchema()
root
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: string (nullable = true)
| | |-- events: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- active: long (nullable = true)
| | | | |-- index: long (nullable = true)
| | | | |-- mode: long (nullable = true)
| | | | |-- rate: long (nullable = true)
| | | | |-- timestamp: string (nullable = true)
第3点:爆炸数据列
>>> json_data_1 = json_data.withColumn("data_values", F.explode_outer("data"))
>>> json_data_1.printSchema ()
root
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: string (nullable = true)
| | |-- events: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- active: long (nullable = true)
| | | | |-- index: long (nullable = true)
| | | | |-- mode: long (nullable = true)
| | | | |-- rate: long (nullable = true)
| | | | |-- timestamp: string (nullable = true)
|-- data_values: struct (nullable = true)
| |-- date: string (nullable = true)
| |-- events: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- active: long (nullable = true)
| | | |-- index: long (nullable = true)
| | | |-- mode: long (nullable = true)
| | | |-- rate: long (nullable = true)
| | | |-- timestamp: string (nullable = true)
第4点:根据要求选择列
>>> newData = json_data_1.withColumn("events", json_data_1.data_values.events).withColumn("date", json_data_1.data_values.date)
>>> newData.show()
+--------------------+--------------------+--------------------+----------+
| data| data_values| events| date|
+--------------------+--------------------+--------------------+----------+
|[[2019-02-20, [[0...|[2019-02-20, [[0,...|[[0, 0, 1, 0, 201...|2019-02-20|
|[[2019-02-20, [[0...|[2019-02-21, [[1,...|[[1, 0, 1, 0, 201...|2019-02-21|
|[[2019-02-20, [[0...|[2019-02-22, [[1,...|[[1, 0, 1, 0, 201...|2019-02-22|
|[[2019-02-20, [[0...|[2019-02-23, [[1,...|[[1, 3, 1, 1, 201...|2019-02-23|
|[[2019-02-20, [[0...|[2019-02-24, [[1,...|[[1, 0, 1, 1, 201...|2019-02-24|
+--------------------+--------------------+--------------------+----------+
第5点:从数据框中删除数据列
>>> newData_v1 = newData.drop(newData.data)
>>> newData_v1.show()
+--------------------+--------------------+----------+
| data_values| events| date|
+--------------------+--------------------+----------+
|[2019-02-20, [[0,...|[[0, 0, 1, 0, 201...|2019-02-20|
|[2019-02-21, [[1,...|[[1, 0, 1, 0, 201...|2019-02-21|
|[2019-02-22, [[1,...|[[1, 0, 1, 0, 201...|2019-02-22|
|[2019-02-23, [[1,...|[[1, 3, 1, 1, 201...|2019-02-23|
|[2019-02-24, [[1,...|[[1, 0, 1, 1, 201...|2019-02-24|
+--------------------+--------------------+----------+
第6点:从newData_v1中删除data_values列
>>> finalDataframe = newData_v1.drop(newData_v1.data_values)
>>> finalDataframe.show(truncate = False)
+--------------------------------------------------------------------------------------------------------------------------------------------+----------+
|events |date |
+--------------------------------------------------------------------------------------------------------------------------------------------+----------+
|[[0, 0, 1, 0, 2019-02-20T00:00:00], [0, 1, 1, 0, 2019-02-20T00:01:00], [0, 2, 1, 0, 2019-02-20T00:02:00]] |2019-02-20|
|[[1, 0, 1, 0, 2019-02-21T00:03:00], [0, 1, 1, 0, 2019-02-21T00:04:00], [1, 2, 1, 1, 2019-02-21T00:05:00], [1, 3, 1, 1, 2019-02-21T00:06:00]]|2019-02-21|
|[[1, 0, 1, 0, 2019-02-22T00:03:00], [0, 1, 1, 0, 2019-02-22T00:04:00], [1, 2, 1, 1, 2019-02-22T00:05:00], [1, 3, 1, 1, 2019-02-22T00:06:00]]|2019-02-22|
|[[1, 3, 1, 1, 2019-02-23T00:16:00]] |2019-02-23|
|[[1, 0, 1, 1, 2019-02-24T00:03:00], [1, 1, 1, 0, 2019-02-24T00:04:00]] |2019-02-24|
+--------------------------------------------------------------------------------------------------------------------------------------------+----------+
从中学习总是使用新的数据框来存储转换。请检查是否对您有帮助:)