Question

我在使用正则表达式方面遇到了麻烦。我的样本数据是：

12 13 hello hiiii hhhhh

this doesnt have numeric so should be removed
Even this line should be excluded
`12` this line contains numeric shouldn't exclude
Hope even this line should be excluded

scala> val pattern = "[a-z][A-Z]".r

pattern：scala.util.matching.Regex = [a-z] [A-Z]

scala> val b = a.filter(line => !line.startsWith(pattern))
<console>:31: error: type mismatch;

发现：scala.util.matching.Regex 必需：字符串 val b = a.filter（line =＆gt;！line.startsWith（pattern）） ^

或者如果我使用

scala> val b = a.filter(line => !line.startsWith("[a-z][A-Z]".r)).take(3)

：29：错误：类型不匹配;
  发现：scala.util.matching.Regex
  required：String

     val b = a.filter(line => !line.startsWith("[a-z][A-Z]".r)).take(3)                                                                                                                                                            ^

我实际上不确定如何在spark中使用正则表达式。请帮帮我。

Answer 1

您的正则表达式只会匹配由小写字母组成的单词，然后是大写字母。即aA，bA，rF等。所以它不会丢弃你名单上的任何组成部分。

所以你可能想把它改成这个：

[a-zA-Z]*

因此它将匹配任何仅由字母组成的单词（小写和大写）

然后关于匹配问题，你使用了错误的方法，匹配正则表达式的正确方法是这样的：

val pattern = """[a-zA-Z]*""".r

val filtered = rdd.filter(line => !pattern.pattern.matcher(line).matches)

这里输出：

scala> filtered.foreach(println)
12
13

您可以检查API的正则表达式here

在spark中使用正则表达式

1 个答案: