使用rdd

时间:2015-05-21 19:38:00

标签: python apache-spark

我有一个rdd

[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0',
 u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1',
 u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1',
 u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']

使用此代码

rdd = rdd.groupBy(lambda x: x.split(",")[9])
new_rdds = [sc.parallelize(x[1]) for x in rdd.collect()]

for x in new_rdds:
    print x.collect()

我得到了

 [u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0'],
 [u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1',
  u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1']
 [ u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']

有没有办法只获得特定的rdd,例如x [9] = 2014

所以我可以

[u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']

2 个答案:

答案 0 :(得分:1)

您可以过滤输入rdd,例如rdd.filter(lambda x: x.split(",")[9] == 2014)

答案 1 :(得分:1)

您可以使用filter()选择特定行。

用你的起始rdd:

rdd = rdd.filter(lambda line: line.split(",")[9] == 2014)