我有一个看起来像
的rdd[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0',
u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1',
u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1',
u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']
有没有办法获得三个单独的rdds,比如根据年份列值制作过滤器?
[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0']
和
[ u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1',
u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1']
和
[u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']
答案 0 :(得分:3)
这是使用groupBy
的一种方式,并假设您的原始RDD具有变量名称rdd
:
rdd = rdd.groupBy(lambda x: x.split(",")[9])
new_rdds = [sc.parallelize(x[1]) for x in rdd.collect()]
for x in new_rdds:
print x.collect()
答案 1 :(得分:1)
There's a better solution than this.I learned many things working on this and wasted so much of time couldn't resist to post it.
from ...subdir1 import module1
It's very confusing for me to work with strings so i changed them into ints.
In [60]: a
Out[60]:
[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0',
u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1',
u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1',
u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']
Sorted b based on year.
In [61]: b=[map(int,elem.split(',')) for elem in a]
In [62]: b
Out[62]:
[[1, 0, 0, 0, 0, 0, 0, 0, 1, 2013, 52, 0, 4, 1, 0],
[1, 0, 0, 0, 1, 1, 0, 1, 1, 2012, 49, 1, 1, 0, 1],
[1, 0, 0, 0, 1, 1, 0, 0, 1, 2012, 49, 1, 1, 0, 1],
[0, 1, 0, 0, 0, 0, 1, 1, 1, 2014, 45, 0, 0, 1, 0]]
Using groupby from operator module to group based on year.
In [63]: b_s=sorted(b,key=itemgetter(-6))
In [64]: b_s
Out[64]:
[[1, 0, 0, 0, 1, 1, 0, 1, 1, 2012, 49, 1, 1, 0, 1],
[1, 0, 0, 0, 1, 1, 0, 0, 1, 2012, 49, 1, 1, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1, 2013, 52, 0, 4, 1, 0],
[0, 1, 0, 0, 0, 0, 1, 1, 1, 2014, 45, 0, 0, 1, 0]]