Spark RDD中的字符串替换

时间:2016-04-03 18:09:40

标签: python pyspark

我将首先解释该问题的代码。

numPartitions = 2
rawData1 = sc.textFile('train_new.csv', numPartitions,use_unicode=False)


rawData1.take(1)

['1,0,0,0,0,0,0,0,0,0,0,1,0,0,5,0,0,0,0,0,0,0,0,0,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,9,0,0,0,0,0,Class_2']

现在我想将 Class_2 替换为2

替换答案后应

['1,0,0,0,0,0,0,0,0,0,0,1,0,0,5,0,0,0,0,0,0,0,0,0,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,9,0,0,0,0,0,2']

一旦我得到了这一行,我将执行整个数据集的操作

先谢谢 Aashish

1 个答案:

答案 0 :(得分:0)

lambda

应该做的不止于此。它的工作原理是将RDD中的每个元素映射到['2']函数,并返回一个新的数据集。

使用','分隔符将元素拆分为数组,切除以省略最后一个元素,然后使用额外的元素map,然后使用','将数组连接在一起。

可以通过适当修改lambda函数来进行更精细的构造。