我正在从Spark结构化流媒体(pyspark)中的Kafka读取2个流(stream1和stream2)。我必须计算stream1和stream2的偏移量之差。
我正在尝试这样的事情:
<class 'pyspark.sql.dataframe.DataFrame'>
root
|--timestamp: timestamp (nullable = true)
|-- value: string (nullable = true)
|-- offset: double (nullable = true)
|-- string_val: string (nullable = true)
|-- ping: double (nullable = true)
|-- date: string (nullable = true)
|-- time: string (nullable = true)
|-- offset_v1: double (nullable = true)
|-- date_time: string (nullable = true)
|-- date_format: timestamp (nullable = true)
<class 'pyspark.sql.dataframe.DataFrame'>
|-- Mean: double (nullable = true)
|-- pingTime: timestamp (nullable = true)
|-- Std_Deviation: double (nullable = true)
|-- devTime: timestamp (nullable = true)
|-- offset_v2: double (nullable = true)
|-- upperBound: double (nullable = true)
|-- lowerBound: double (nullable = true)
stream2 = stream2.withColumn('difference',stream2.offset_v2-stream1.offset_v1)
它抛出一个错误:
pyspark.sql.utils.AnalysisException:u'Resolved属性 缺少offset_v1#95 upperBound#182,Std_Deviation#149,lowerBound#189,Mean#133,pingTime#129-T30000ms,devTime#144-T30000ms,offset_v2#155在运算符中!项目[Mean#133,pingTime#129-T30000ms, Std_Deviation#149,devTime#144-T30000ms,offset_v2#155, upperBound#182,lowerBound#189,(offset_v2#155-offset_v1#95)AS 差异#233]
答案 0 :(得分:0)
像Venki一样,您需要先加入才能将相关行进行比较。你有专栏吗?一个日期或一个 id 可以解决问题。假设您在两个数据帧中都有一个名为 join_col 的:
B
如果找不到合适的联接,则在比较不同长度的列(我认为这是您要处理的列)的情况下会出现问题。