(u'142578', (u'The-North-side-9890', (u' 12457896', 45.0)))
(u'124578', (u'The-West-side-9091', (u' 14578217', 0.0)))
这是我从加入基于Ids的两个RDD得到的,就像(key,(value_left,value_right))一样,使用这个Spark Join。
所以我希望输出像
The-North-side-9890,12457896,45.0
The-West-side-9091,14578217,0.0
为此我尝试使用以下代码
from pyspark import SparkContext
sc = SparkContext("local", "info")
file1 = sc.textFile('/home/hduser/join/part-00000').map(lambda line: line.split(','))
result = file1.map(lambda x: (x[1]+', '+x[2],float(x[3][:-3]))).reduceByKey(lambda a,b:a+b)
result = result.map(lambda x:x[0]+','+str(x[1]))
result = result.map(lambda x: x.lstrip('[(').rstrip(')]')).coalesce(1).saveAsTextFile("hdfs://localhost:9000/finalop")
但是给我以下输出
(u'The-North-side-9896', (u' 12457896',0.0
(u'The-East-side-9876', (u' 47125479',0.0
所以我想清理这个怎么办呢
帮助我实现这一目标。
答案 0 :(得分:3)
试试这个
def rdd2string(t):
def rdd2StringHelper(x):
s = ''
if isinstance(x, collections.Iterable):
for elem in x:
s = s+str(rdd2StringHelper(elem))
return s
else:
return str(x)+','
return rdd2StringHelper(t)[:-1]
yourRDD.map(lambda x: rdd2string(x)).saveAsTextFile(...)
此函数适用于所有类型的元组,这些元组可以由元组(tuple2,tuple3,tuple21等)和列表(列表列表,元组列表,整数列表等)的任意组合形成,并输出一个单元表示为CSV格式的字符串。
它还会从How to remove unwanted stuff like (),[], single quotes from PyPpark output [duplicate]
回答您的问题修改强>
不要忘记添加此import collections
答案 1 :(得分:1)
从中得到:
(u'142578', (u'The-North-side-9890', (u' 12457896', 45.0)))
到此:
The-North-side-9890,12457896,45.0
你需要使用:
result = result.map(lambda (k, (s, (n1, n2))): ','.join([s, str(int(n1)), str(float(n2))]))