Question

import glob
import pandas as pd
import matplotlib.pyplot as plt

files = glob.glob(# file pattern something like '*.csv')

for file in files:
    df1=pd.read_csv(file,header=1,sep=',')
    fig = plt.figure()
    plt.subplot(2, 1, 1)
    plt.plot(df1.iloc[:,[1]],df1.iloc[:,[2]])

    plt.subplot(2, 1, 2)
    plt.plot(df1.iloc[:,[3]],df1.iloc[:,[4]])
    plt.show() # this wil stop the loop until you close the plot

这里我执行了Sum操作但是可以在reduceByKey中执行 count 操作。

就像我想的那样，

val temp1 = tempTransform.map({ temp => ((temp.getShort(0), temp.getString(1)), (USAGE_TEMP.getDouble(2), USAGE_TEMP.getDouble(3)))})
  .reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))

但请不要提出任何建议。

Answer 1

不，你不能这样做。 RDD为延迟计算提供迭代器模型。所以每个元素只会访问一次。

如果你真的想按照描述进行求和，首先重新分区你的rdd，然后使用mapWithPartition，在闭包中实现你的计算（请记住，RDD中的元素不是按顺序）。

Answer 2

好吧，计数相当于求1 s，所以只需将每个值元组中的第一项映射到1，并像之前那样对元组的两个部分求和：

val temp1 = tempTransform.map { temp => 
   ((temp.getShort(0), temp.getString(1)), (1, USAGE_TEMP.getDouble(3)))
}
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))

结果将是RDD[((Short, String), (Int, Double))]，其中值元组中的第一项（Int）是与该键匹配的原始记录数。

这实际上是经典的map-reduce示例 - word count。

计算spark中reduceByKey的操作

2 个答案: