使用python进行Spark Rdd过滤和映射

时间:2016-01-12 12:24:55

标签: python apache-spark rdd

I have the list contains following  data

cm_id    pr_id    pr_val

625456,  123456   90.0

625456,  123457   89.0

625457,  123356,   98.0

625457,  123457,   9.0

我将此列表传递给功能prod_ret(input_list)

 how can i map  input_list.map(position1 i.e suppose for 625456 ).fiilter( pr_id == '123456 ' && pr_id ==  '123457') if matches then get the  the rdd,

所以它看起来像

625456,'prod_ret',90.0,89.0 ==> prod_ret is fix string   
625457,'prod_ret',98.0,9.0

And i want to apply the above RDD to **a/(a+b)*100** this formula for 625456 a=90.0 b=89.0 and for 625457 a =98.0 b=9.0

和最终的RDD将是我需要保存就像,

625456,prod_ret,198.88

625457,prod_ret,109.18

为了得到这个我需要编写像

这样的功能
def prod_ret(input_list):

   ##  for position1.map filter the pr_id and pr_value  for this i tried following

   filter_list = input_list.map(lambda x: (x.split(",")[0],x[1],float(x[2])).filter(lambda p: p[1] == '123456' && p[1] == '123457')#is this possible


  ## then get output RDD like 625456,'prod_ret',pr_id1.value,pr_id2.value i.e 90.0,89.0 so the rddlist will look like 625456,'prod_ret',90.0,89.0 for cm_id 625456  
  ## then apply the formula a/(a+b)*100 on this RDD where a=90.0,b=89.0
  ## save the rdd with output 625456,prod_ret,198.88  for cm_id 625456
  ## save the rdd with output 625457,prod_ret,109.18  for cm_id 

0 个答案:

没有答案