是否可以从Stream2中存在的列“ B”中减去Stream1中存在的列“ A”?

时间:2019-04-10 07:52:27

标签: pyspark apache-kafka pyspark-sql spark-structured-streaming

我正在从Spark结构化流媒体(pyspark)中的Kafka读取2个流(stream1和stream2)。我必须计算stream1和stream2的偏移量之差。

我正在尝试这样的事情:

<class 'pyspark.sql.dataframe.DataFrame'>
root
|--timestamp: timestamp (nullable = true)
|-- value: string (nullable = true)
|-- offset: double (nullable = true)
|-- string_val: string (nullable = true)
|-- ping: double (nullable = true)
|-- date: string (nullable = true)
|-- time: string (nullable = true)
|-- offset_v1: double (nullable = true)
|-- date_time: string (nullable = true)
|-- date_format: timestamp (nullable = true)

<class 'pyspark.sql.dataframe.DataFrame'>
|-- Mean: double (nullable = true)
|-- pingTime: timestamp (nullable = true)
|-- Std_Deviation: double (nullable = true)
|-- devTime: timestamp (nullable = true)
|-- offset_v2: double (nullable = true)
|-- upperBound: double (nullable = true)
|-- lowerBound: double (nullable = true)

stream2 = stream2.withColumn('difference',stream2.offset_v2-stream1.offset_v1)

它抛出一个错误:

  

pyspark.sql.utils.AnalysisException:u'Resolved属性   缺少offset_v1#95   upperBound#182,Std_Deviation#149,lowerBound#189,Mean#133,pingTime#129-T30000ms,devTime#144-T30000ms,offset_v2#155在运算符中!项目[Mean#133,pingTime#129-T30000ms,   Std_Deviation#149,devTime#144-T30000ms,offset_v2#155,   upperBound#182,lowerBound#189,(offset_v2#155-offset_v1#95)AS   差异#233]

1 个答案:

答案 0 :(得分:0)

像Venki一样,您需要先加入才能将相关行进行比较。你有专栏吗?一个日期或一个 id 可以解决问题。假设您在两个数据帧中都有一个名为 join_col 的:

B

如果找不到合适的联接,则在比较不同长度的列(我认为这是您要处理的列)的情况下会出现问题。