比较文档并删除Spark和Scala中的重复项

时间:2015-06-05 11:49:10

标签: scala apache-spark

假设我有这些文件,我想删除重复:

buy sansa view sell product player charger world charge player charger receive 
oldest daughter teen daughter player christmas so daughter life line listen sooo hold
thourghly sansa view delete song time wont wont connect-computer computer put time 
oldest daughter teen daughter player christmas so daughter life line listen sooo hold
oldest daughter teen daughter player christmas so daughter life line listen sooo hold

这是输出:

buy sansa view sell product player charger world charge player charger receive 
oldest daughter teen daughter player christmas so daughter life line listen sooo hold
thourghly sansa view delete song time wont wont connect-computer computer put time 

Scala和Spark有没有解决方案?

2 个答案:

答案 0 :(得分:1)

您似乎是在逐行读取文件,因此var app = angular.module('myApp', []); app.controller('MyCtrl', ['$scope', function($scope) { // Define $scope.telephone as an array $scope.telephoneNumbers = []; $scope.addPhoneInput = function() { $scope.telephoneNumbers.push({}); }; // This is just so you can see the array values changing and working! Check your console as you're typing in the inputs :) $scope.$watch('telephoneNumbers', function(value) { console.log(value); }, true); }]); app.directive('phoneNumber', function(){ return { replace:true, scope: { ngModel: '=', }, templateUrl: "phone-number.template.html" } }); 会正确地将其读入字符串的RDD,每行一行。在此之后,textFile会将RDD缩小到一个独特的集合。

distinct

答案 1 :(得分:0)

使用reduceByKey函数,您可以达到您的要求。

您可以使用此代码

val textFile = spark.textFile("hdfs://...")
val uLine = textFile.map(line => (line, 1))
                 .reduceByKey(_ + _).map(uLine => uLine._1)
uLine.saveAsTextFile("hdfs://...") 

或者您可以使用

val uLine = spark.textFile("hdfs://...").distinct
uLine.saveAsTextFile("hdfs://...")