用火花划分rdd

时间:2015-05-12 19:55:14

标签: python apache-spark

我有一个看起来像

的rdd
[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0',
 u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1',
 u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1',
 u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']

有没有办法获得三个单独的rdds,比如根据年份列值制作过滤器?

[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0']

[ u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1',
     u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1']

  [u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']

2 个答案:

答案 0 :(得分:3)

这是使用groupBy的一种方式,并假设您的原始RDD具有变量名称rdd

rdd = rdd.groupBy(lambda x: x.split(",")[9])
new_rdds = [sc.parallelize(x[1]) for x in rdd.collect()]

for x in new_rdds:
    print x.collect()

答案 1 :(得分:1)

There's a better solution than this.I learned many things working on this and wasted so much of time couldn't resist to post it.

from ...subdir1 import module1

It's very confusing for me to work with strings so i changed them into ints.

In [60]: a
Out[60]: 
[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0',
 u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1',
 u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1',
 u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']

Sorted b based on year.

In [61]: b=[map(int,elem.split(',')) for elem in a]

In [62]: b
Out[62]: 
[[1, 0, 0, 0, 0, 0, 0, 0, 1, 2013, 52, 0, 4, 1, 0],
 [1, 0, 0, 0, 1, 1, 0, 1, 1, 2012, 49, 1, 1, 0, 1],
 [1, 0, 0, 0, 1, 1, 0, 0, 1, 2012, 49, 1, 1, 0, 1],
 [0, 1, 0, 0, 0, 0, 1, 1, 1, 2014, 45, 0, 0, 1, 0]]

Using groupby from operator module to group based on year.

In [63]: b_s=sorted(b,key=itemgetter(-6))

In [64]: b_s
Out[64]: 
[[1, 0, 0, 0, 1, 1, 0, 1, 1, 2012, 49, 1, 1, 0, 1],
 [1, 0, 0, 0, 1, 1, 0, 0, 1, 2012, 49, 1, 1, 0, 1],
 [1, 0, 0, 0, 0, 0, 0, 0, 1, 2013, 52, 0, 4, 1, 0],
 [0, 1, 0, 0, 0, 0, 1, 1, 1, 2014, 45, 0, 0, 1, 0]]