PySpark如何将rdd转换为字符串

时间:2018-04-12 20:53:42

标签: python pyspark

我需要在url中传递坐标,但我需要将rdd转换为字符串并用分号分隔。

all_coord_iso_rdd.take(4)

[(-73.57534790039062, 45.5311393737793),
 (-73.574951171875, 45.529457092285156),
 (-73.5749282836914, 45.52922821044922),
 (-73.57501220703125, 45.52901077270508)]

type(all_coord_iso_rdd)
pyspark.rdd.PipelinedRDD

结果展望:

"-73.57534790039062,45.5311393737793;-73.574951171875,45.529457092285156,
 -73.5749282836914,45.52922821044922;-73.57501220703125,45.52901077270508"

我的网址格式如下:

http://127.0.0.1/match/v1/driving/-73.57534790039062,45.5311393737793; -73.574951171875,45.529457092285156,-73.5749282836914,45.52922821044922;-73.57501220703125,45.52901077270508

2 个答案:

答案 0 :(得分:1)

在您发布的代码段all_coord_iso_rdd中,rddtuple(float, float),每行为take(n)。致电n会从rdd返回x = all_coord_iso_rdd.take(4) print(x) #[(-73.57534790039062, 45.5311393737793), # (-73.574951171875, 45.529457092285156), # (-73.5749282836914, 45.52922821044922), # (-73.57501220703125, 45.52901077270508)] 条记录。

str.join

返回的值只是浮点数元组的列表。要将其转换为所需的格式,我们可以在列表理解中使用float

首先,您需要将str转换为",",然后我们可以使用map(str, ...)加入每个元组中的值。我们使用str将每个值映射到print([",".join(map(str, item)) for item in x]) #['-73.5753479004,45.5311393738', # '-73.5749511719,45.5294570923', # '-73.5749282837,45.5292282104', # '-73.575012207,45.5290107727']

这会产生:

";"

最后使用print(";".join([",".join(map(str, item)) for item in x])) 加入结果列表以获得所需的输出。

{{1}}

答案 1 :(得分:1)

这是一种纯粹的火花方式(可能对更大的有用) rdds /不同的用例):

list=[(-73.57534790039062, 45.5311393737793),(-73.574951171875, 45.529457092285156),\
 (-73.5749282836914, 45.52922821044922),(-73.57501220703125, 45.52901077270508)]

rdd=sc.parallelize(list)
rdd.map(lambda row: ",".join([str(elt) for elt in row]))\
   .reduce(lambda x,y: ";".join([x,y]))