Question

我使用TCP套接字向spark spark（Python）发送数据。

使用windowLength = 4秒的窗口流和slideInterval = 2秒
我在一个窗口部分的RDD如下所示：

[1,2,3,4]    
[2,2,2,2]    
[5,6,7,8]    
[1,2,1,1]    
[8,7,6,5]

如何找到相应的＆＃39;的平均值，中位数，最大值，标准值，IQR。值。
mean = [（1 + 2 + 5 + 1 + 8）/ 5，（2,2,6,2,7）/ 5，（3 + 2 + 7 + 1 + 6）/ 5，（4 + 2 + 8 + 1 + 5）/ 5]

到目前为止我的代码是：

def importData():
    sc = SparkContext(appName="test1")
    ssc = StreamingContext(sc, 2)
    RowsData = ssc.socketTextStream("localhost", 9999)
    RowsData = RowsData.map(lambda x: x.split(","))
    RowsDataLIST = RowsData.map(lambda mylist: [int(strTono) for 
    strTono in mylist])
    print("Print the Rows Data List with windows 4,1")


    RowsDataLIST = RowsDataLIST.window(4,2)
    TheMean = RowsDataLIST.reduce(lambda x, y: list(map(np.mean,zip(x,y))))

    TheMean.pprint()

    ssc.start()
    ssc.awaitTermination()


def main():
    importData()
if __name__ == "__main__":  
    main()

平均值的输出是[1.0,0.75,4.0,2.25]，这显然是错误的。我理解.reduce（lambda x，y：...），它一次取两行并取平均值。但是，如果我需要窗口内RDD中所有相应元素的均值，那么该方法应该是什么。

我可以采用总和除以计数的一种方法。但是想知道有不同的方式。
另外，我如何计算列表中相应元素的不同统计数据。

我是新手，请引导。

spark streaming reduceByWindow，需要mean，median，max，std，IQR

0 个答案: