我有一个pyspark查询,该查询返回WrappedArray:
det_port_arr =
vessel_arrival_depart_df.select(vessel_arrival_depart_df['DetectedPortArrival'])
det_port_arr.show(2, truncate=False)
det_port_arr.dtypes
输出是具有单个列的DataFrame,但是该列是包含结构数组的结构:
|DetectedPortArrival |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[WrappedArray([portPoi,5555,BEILUN [CNBEI],marinePort], [portPoi,5729,NINGBO [CNNBO],marinePort], [portPoi,5730,NINGBO PT [CNNBG],marinePort]),device,Moored]|
|null |
[('DetectedPortArrival',
'struct<poiMeta:array<struct<poiCategory:string,poiId:bigint,poiName:string,poiType:string>>,sourceType:string,statusType:string>')]
如果我尝试选择结构的poiMeta成员:
temp = vessel_arrival_depart_df.select(vessel_arrival_depart_df['DetectedPortArrival']['poiMeta'])
temp.show(truncate=False)
print type(temp)
我得到
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|DetectedPortArrival.poiMeta |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[portPoi,5555,BEILUN [CNBEI],marinePort], [portPoi,5729,NINGBO [CNNBO],marinePort], [portPoi,5730,NINGBO PT [CNNBG],marinePort]] |
|null |
以下是数据类型:
temp.dtypes
('DetectedPortArrival.poiMeta',
'array<struct<poiCategory:string,poiId:bigint,poiName:string,poiType:string>>')]
但这是问题所在:我似乎无法查询该列DetectedPortArrival.poiMeta:
df2 = temp.selectExpr("DetectedPortArrival.poiMeta")
df2.show(2)
AnalysisExceptionTraceback (most recent call last)
<ipython-input-46-c7f0041cffe9> in <module>()
----> 1 df2 = temp.selectExpr("DetectedPortArrival.poiMeta")
2 df2.show(3)
/opt/spark/spark-2.1.0-bin-hadoop2.4/python/pyspark/sql/dataframe.py in selectExpr(self, *expr)
996 if len(expr) == 1 and isinstance(expr[0], list):
997 expr = expr[0]
--> 998 jdf = self._jdf.selectExpr(self._jseq(expr))
999 return DataFrame(jdf, self.sql_ctx)
1000
/opt/spark/spark-2.1.0-bin-hadoop2.4/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/opt/spark/spark-2.1.0-bin-hadoop2.4/python/pyspark/sql/utils.py in deco(*a, **kw)
67 e.java_exception.getStackTrace()))
68 if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
71 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u"cannot resolve '`DetectedPortArrival.poiMeta`' given input columns: [DetectedPortArrival.poiMeta]; line 1 pos 0;\n'Project ['DetectedPortArrival.poiMeta]\n+- Project [DetectedPortArrival#268.poiMeta AS DetectedPortArrival.poiMeta#503]\n +- Project [asOf#263, vesselId#264, DetectedPortArrival#268, DetectedPortDeparture#269]\n +- Sort [asOf#263 ASC NULLS FIRST], true\n +- Project [smfPayloadData#1.paired.shipmentId AS shipmentId#262, smfPayloadData#1.timestamp.asOf AS asOf#263, smfPayloadData#1.paired.vesselId AS vesselId#264, smfPayloadData#1.paired.vesselName AS vesselName#265, smfPayloadData#1.geolocation.speed AS speed#266, smfPayloadData#1.geolocation.detectedPois AS detectedPois#267, smfPayloadData#1.events.DetectedPortArrival AS DetectedPortArrival#268, smfPayloadData#1.events.DetectedPortDeparture AS DetectedPortDeparture#269]\n +- Filter ((((cast(smfPayloadData#1.paired.vesselId as double) = cast(9776183 as double)) && isnotnull(smfPayloadData#1.paired.shipmentId)) && (length(smfPayloadData#1.paired.shipmentId) > 0)) && (isnotnull(smfPayloadData#1.paired.vesselId) && (isnotnull(smfPayloadData#1.events.DetectedPortArrival) || isnotnull(smfPayloadData#1.events.DetectedPortDeparture))))\n +- SubqueryAlias smurf_processed\n +- Relation[smfMetaData#0,smfPayloadData#1,smfTransientData#2] parquet\n"
关于如何查询该列的任何建议?
答案 0 :(得分:0)
您是否只是根据索引选择列?
temp.select(temp.columns[0]).show()
最好的问候