Question

我想从我的文件中过滤掉字母数字和数字。我正在使用Spark-Shell。这些是我的文件sparktest.txt的内容：

这是1个文件而不是54783.你会把这个文件写成Writt3n吗？ HDFS？

定义要收集的文件：

scala> val myOnlyWords = myWords.map(x => x).filter(x => regexpr(x).matches)
<console>:27: error: scala.util.matching.Regex does not take parameters
       val myOnlyWords = myWords.map(x => x).filter(x => regexpr(x).matches)

将行保存到长度大于2的单词数组中

namespace

定义要使用的正则表达式。我只想要匹配＆＃34; [A-Za-z] +＆＃34;：

的字符串

namespace :healthandwellness do
  resources :healthcare
end

尝试过滤掉字母数字和数字字符串：

app/controllers/healthsandwellness/healthcare_controller.rb

这就是我被困的地方。我希望结果看起来像这样：

Array [String] = Array（This，file，not，would，you，this，file，HDFS）

Answer 1

您实际上可以在一次转换中执行此操作并过滤flatMap中的拆分数组：

val myWords = myLines.flatMap(x => x.split("\\W+").filter(x => x.matches("[A-Za-z]+") && x.length > 2))

当我在spark-shell中运行它时，我看到：

scala> val rdd1 = sc.parallelize(Array("This is 1 file not 54783. Would you l1ke this file to be Writt3n to HDFS?"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[11] at parallelize at <console>:21

scala> val myWords = rdd1.flatMap(x => x.split("\\W+").filter(x => x.matches("[A-Za-z]+") && x.length > 2))
myWords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at flatMap at <console>:23

scala> myWords.collect
...
res0: Array[String] = Array(This, file, not, Would, you, this, file, HDFS)

Answer 2

您可以使用filter(x => regexpr.pattern.matcher(x).matches)或filter(_.matches("[A-Za-z]+"))

如何使用正则表达式过滤掉Scala中的字母数字字符串

2 个答案: