如何解包pyspark WrappedArray

时间:2018-08-29 13:42:00

标签: arrays struct pyspark

我有一个pyspark查询,该查询返回WrappedArray:

det_port_arr = 
vessel_arrival_depart_df.select(vessel_arrival_depart_df['DetectedPortArrival'])
det_port_arr.show(2, truncate=False)
det_port_arr.dtypes

输出是具有单个列的DataFrame,但是该列是包含结构数组的结构:

|DetectedPortArrival                                                                                                                                          |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[WrappedArray([portPoi,5555,BEILUN [CNBEI],marinePort], [portPoi,5729,NINGBO [CNNBO],marinePort], [portPoi,5730,NINGBO PT [CNNBG],marinePort]),device,Moored]|
|null                                                                                                                                                         |

[('DetectedPortArrival',
  'struct<poiMeta:array<struct<poiCategory:string,poiId:bigint,poiName:string,poiType:string>>,sourceType:string,statusType:string>')]

如果我尝试选择结构的poiMeta成员:

temp = vessel_arrival_depart_df.select(vessel_arrival_depart_df['DetectedPortArrival']['poiMeta'])
temp.show(truncate=False)
print type(temp)

我得到

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|DetectedPortArrival.poiMeta                                                                                                                                        |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[portPoi,5555,BEILUN [CNBEI],marinePort], [portPoi,5729,NINGBO [CNNBO],marinePort], [portPoi,5730,NINGBO PT [CNNBG],marinePort]]                                  |
|null                                                                                                                                                               |

以下是数据类型:

temp.dtypes

('DetectedPortArrival.poiMeta',
  'array<struct<poiCategory:string,poiId:bigint,poiName:string,poiType:string>>')]

但这是问题所在:我似乎无法查询该列DetectedPortArrival.poiMeta:

df2 = temp.selectExpr("DetectedPortArrival.poiMeta")
df2.show(2)

AnalysisExceptionTraceback (most recent call last)
<ipython-input-46-c7f0041cffe9> in <module>()
----> 1 df2 = temp.selectExpr("DetectedPortArrival.poiMeta")
      2 df2.show(3)

/opt/spark/spark-2.1.0-bin-hadoop2.4/python/pyspark/sql/dataframe.py in selectExpr(self, *expr)
    996         if len(expr) == 1 and isinstance(expr[0], list):
    997             expr = expr[0]
--> 998         jdf = self._jdf.selectExpr(self._jseq(expr))
    999         return DataFrame(jdf, self.sql_ctx)
   1000 

/opt/spark/spark-2.1.0-bin-hadoop2.4/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/opt/spark/spark-2.1.0-bin-hadoop2.4/python/pyspark/sql/utils.py in deco(*a, **kw)
     67                                              e.java_exception.getStackTrace()))
     68             if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
     70             if s.startswith('org.apache.spark.sql.catalyst.analysis'):
     71                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)

AnalysisException: u"cannot resolve '`DetectedPortArrival.poiMeta`' given input columns: [DetectedPortArrival.poiMeta]; line 1 pos 0;\n'Project ['DetectedPortArrival.poiMeta]\n+- Project [DetectedPortArrival#268.poiMeta AS DetectedPortArrival.poiMeta#503]\n   +- Project [asOf#263, vesselId#264, DetectedPortArrival#268, DetectedPortDeparture#269]\n      +- Sort [asOf#263 ASC NULLS FIRST], true\n         +- Project [smfPayloadData#1.paired.shipmentId AS shipmentId#262, smfPayloadData#1.timestamp.asOf AS asOf#263, smfPayloadData#1.paired.vesselId AS vesselId#264, smfPayloadData#1.paired.vesselName AS vesselName#265, smfPayloadData#1.geolocation.speed AS speed#266, smfPayloadData#1.geolocation.detectedPois AS detectedPois#267, smfPayloadData#1.events.DetectedPortArrival AS DetectedPortArrival#268, smfPayloadData#1.events.DetectedPortDeparture AS DetectedPortDeparture#269]\n            +- Filter ((((cast(smfPayloadData#1.paired.vesselId as double) = cast(9776183 as double)) && isnotnull(smfPayloadData#1.paired.shipmentId)) && (length(smfPayloadData#1.paired.shipmentId) > 0)) && (isnotnull(smfPayloadData#1.paired.vesselId) && (isnotnull(smfPayloadData#1.events.DetectedPortArrival) || isnotnull(smfPayloadData#1.events.DetectedPortDeparture))))\n               +- SubqueryAlias smurf_processed\n                  +- Relation[smfMetaData#0,smfPayloadData#1,smfTransientData#2] parquet\n"

关于如何查询该列的任何建议?

1 个答案:

答案 0 :(得分:0)

您是否只是根据索引选择列?

temp.select(temp.columns[0]).show()

最好的问候