scala-regexp:将字符串拆分为两个后续单词的数组

时间:2018-04-28 18:35:18

标签: regex scala

我需要将字符串拆分为数组,其中元素为scala后面的两个单词:

"Hello, it is useless text. Hope you can help me."

结果:

[[it is], [is useless], [useless text], [Hope you], [you can], [can help], [help me]]

又一个例子:

"This is example 2. Just\nskip it."

结果: [[This is], [is example], [Just skip], [skip it]]

我试过这个正则表达式:

var num = """[a-zA-Z]+\s[a-zA-Z]+""".r

但输出是:

scala> for (m <- re.findAllIn("Hello, it is useless text. Hope you can help me.")) println(m)
it is
useless text
Hope you
can help

所以它忽略了一些情况。

4 个答案:

答案 0 :(得分:1)

首先分割标点符号和数字,然后在空格上分割,然后滑过结果。

def doubleUp(txt :String) :Array[Array[String]] =
  txt.split("[.,;:\\d]+")
     .flatMap(_.trim.split("\\s+").sliding(2))
     .filter(_.length > 1)

用法:

val txt1 = "Hello, it is useless text. Hope you can help me."
doubleUp(txt1)
//res0: Array[Array[String]] = Array(Array(it, is), Array(is, useless), Array(useless, text), Array(Hope, you), Array(you, can), Array(can, help), Array(help, me))

val txt2 = "This is example 2. Just\nskip it."
doubleUp(txt2)
//res1: Array[Array[String]] = Array(Array(This, is), Array(is, example), Array(Just, skip), Array(skip, it))

答案 1 :(得分:1)

首先通过删除所有转义字符来处理string

scala> val string = "Hello, it is useless text. Hope you can help me."
val preprocessed = StringContext.processEscapes(string)
//preprocessed: String = Hello, it is useless text. Hope you can help me.
  

OR

scala>val string = "This is example 2. Just\nskip it."
val preprocessed = StringContext.processEscapes(string)
//preprocessed: String =
//This is example 2. Just
//skip it.

然后过滤掉所有必要的字符(如字符,空格等......)并使用slide函数

val result = preprocessed.split("\\s").filter(e => !e.isEmpty && !e.matches("(?<=^|\\s)[A-Za-z]+\\p{Punct}(?=\\s|$)") ).sliding(2).toList

//scala> res9: List[Array[String]] = List(Array(it, is), Array(is, useless), Array(useless, Hope), Array(Hope, you), Array(you, can), Array(can, help))

答案 2 :(得分:0)

您需要使用split将字符串分解为由非单词字符分隔的单词,然后sliding以您想要的方式将单词加倍;

val text = "Hello, it is useless text. Hope you can help me."

text.trim.split("\\W+").sliding(2)

您可能还想删除转义字符,如其他答案中所述。

答案 3 :(得分:-1)

抱歉,我只懂Python。我听说两者差不多了。希望你能理解

string = "it is useless text. Hope you can help me."

split = string.split(' ')  // splits on space (you can use regex for this)

result = []

no = 0

count = len(split)

for x in range(count):
    no +=1

    if no < count:

        pair = split[x] + ' ' + split[no]   // Adds the current to the next

        result.append(pair)

输出将是:

['it is', 'is useless', 'useless text.', 'text. Hope', 'Hope you', 'you can', 'can help', 'help me.']