Question

我需要转换一个带有两行的rdd，并输入一行rdd。例如：

type ls = JsonProvider<"""
  [{"sex":"male","height":180,"weight":85},
   {"sex":"male","height":160,"weight":60},
   {"sex":"male","height":180,"weight":85}]""">

let dt = ls.GetSamples()

let newJson = 
  dt
  |> Array.map (fun recd ->
      // To do the calculation, you can access the fields via inferred types 
      let bmi = float recd.Height / float recd.Weight

      // But now we need to look at the underlying value, check that it is
      // a record and extract the properties, which is an array of key-value pairs
      match recd.JsonValue with
      | JsonValue.Record props ->
          // Append the new property to the existing properties & re-create record
          Array.append [| "bmi", JsonValue.Float bmi |] props
          |> JsonValue.Record
      | _ -> failwith "Unexpected format" )

// Re-create a new JSON array and format it as JSON
JsonValue.Array(newJson).ToString()

我需要：

rdd1=a
     b

如何在pyspark中执行此步骤？问题可能是愚蠢的，但我是新的火花。 “UPDATE” 这是从rdd1开始在rdd2和rdd3之间执行笛卡尔。像：

rdd2=(a,b)

我想要这个输出：

rdd3:(k,l)
     (c,g)
     (f,x)

提前致谢

Answer 1

更新我的回复：

initRDD = sc.parallelize(list('aeiou')).map(lambda x: (x, ord(x))).collect()

ssc = StreamingContext(sc, batchDuration=3)

lines = ssc.socketTextStream('localhost', 9999)
items = lines.flatMap(lambda x: x.split())
counts = items.countByValue().map(lambda x: ([x] + initRDD))

它看起来像广播而不是笛卡儿。

Answer 2

你能解释一下你的需求吗？由于丢失了所有并行性，因此使用单行RDD并不是一个好主意。

如果要按键收集数据，可以将RDD转换为对的RDD（键和值）。然后你可以执行reduceByKey，只需将reduce函数作为列表连接，就可以通过键列表来收集所有内容。

Answer 3

如果我对您的问题的理解是正确的，那么使用 flatMap 可以获得所需的输出。

从多行生成一行到RDD

3 个答案: