我有一个rdd
[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0',
u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1',
u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1',
u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']
使用此代码
rdd = rdd.groupBy(lambda x: x.split(",")[9])
new_rdds = [sc.parallelize(x[1]) for x in rdd.collect()]
for x in new_rdds:
print x.collect()
我得到了
[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0'],
[u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1',
u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1']
[ u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']
有没有办法只获得特定的rdd,例如x [9] = 2014
所以我可以
[u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']
答案 0 :(得分:1)
您可以过滤输入rdd,例如rdd.filter(lambda x: x.split(",")[9] == 2014)
。
答案 1 :(得分:1)
您可以使用filter()选择特定行。
用你的起始rdd:
rdd = rdd.filter(lambda line: line.split(",")[9] == 2014)