我在HDFS中有一个输入数据文件。我将阅读该文件并执行一些验证,如下所示。执行验证后,我得到如下结果。我想将逗号的分隔符更改为' \ t'使用pyspark并将其存储在HDFS中。谁能帮我这个。 (请不要csv ans)。提前谢谢。
Validation Code:
dc = data_f.filter("age > 25").filter(data_f.mar == '"married"').groupBy("job","edu").avg("bal","age").sort(data_f.job.desc(),"edu").rdd.map(list).collect()
Result:
[[u'"unknown"', u'"primary"', 1515.974358974359, 48.61538461538461],
[u'"unknown"', u'"secondary"', 1314.2045454545455, 47.84090909090909],
[u'"unknown"', u'"tertiary"', 2328.64, 51.84],
[u'"unknown"', u'"unknown"', 1977.1157894736841, 51.694736842105264],
[u'"unemployed"', u'"primary"', 1685.6097560975609, 44.957317073170735],
[u'"unemployed"', u'"secondary"', 1472.3518072289157, 43.8433734939759],
[u'"unemployed"', u'"tertiary"', 1865.968992248062, 41.031007751937985],
[u'"unemployed"', u'"unknown"', 859.1875, 45.375],
[u'"technician"', u'"primary"', 1512.704, 47.912]]
答案 0 :(得分:0)
如果您需要避免
.csv.write
方法,您可以在rdd上使用此代码段
def concatenate_row(row):
concatenated_row = ""
for col in row:
concatenated_row += str(col) + "\t"
return concatenated_row
result = rdd.map(lambda row : concatenate_row(row))
然后只需致电
saveAsTextFile
方法