我想从我的文件中过滤掉字母数字和数字。我正在使用Spark-Shell。这些是我的文件sparktest.txt的内容:
这是1个文件而不是54783.你会把这个文件写成Writt3n吗? HDFS?
定义要收集的文件:
scala> val myOnlyWords = myWords.map(x => x).filter(x => regexpr(x).matches)
<console>:27: error: scala.util.matching.Regex does not take parameters
val myOnlyWords = myWords.map(x => x).filter(x => regexpr(x).matches)
将行保存到长度大于2的单词数组中
namespace
定义要使用的正则表达式。我只想要匹配&#34; [A-Za-z] +&#34;:
的字符串namespace :healthandwellness do
resources :healthcare
end
尝试过滤掉字母数字和数字字符串:
app/controllers/healthsandwellness/healthcare_controller.rb
这就是我被困的地方。 我希望结果看起来像这样:
Array [String] = Array(This,file,not,would,you,this,file,HDFS)
答案 0 :(得分:2)
您实际上可以在一次转换中执行此操作并过滤flatMap
中的拆分数组:
val myWords = myLines.flatMap(x => x.split("\\W+").filter(x => x.matches("[A-Za-z]+") && x.length > 2))
当我在spark-shell中运行它时,我看到:
scala> val rdd1 = sc.parallelize(Array("This is 1 file not 54783. Would you l1ke this file to be Writt3n to HDFS?"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[11] at parallelize at <console>:21
scala> val myWords = rdd1.flatMap(x => x.split("\\W+").filter(x => x.matches("[A-Za-z]+") && x.length > 2))
myWords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at flatMap at <console>:23
scala> myWords.collect
...
res0: Array[String] = Array(This, file, not, Would, you, this, file, HDFS)
答案 1 :(得分:1)
您可以使用filter(x => regexpr.pattern.matcher(x).matches)
或filter(_.matches("[A-Za-z]+"))