我正在使用斯坦福NLP将文本分成句子,但它忽略了收缩。
所以这是我得到的句子的一个例子:
List(I, 'd, like, to, fix, this, sentence, because, it, 's, broken)
我的目标是连接收缩的单词,以便结果如下所示:
List(I'd, like, to, fix, this, sentence, because, it's, broken)
在scala中有一种优雅的方式吗?基本上我正在寻找一个表达式,它遍历列表,用下一个元素检查一个元素,如果符合条件则连接并按照我的例子返回结果列表。
答案 0 :(得分:2)
scala> val l = List("I", "'d", "like", "to fix", "this", "sentence", "because", "it", "'s", "broken")
l: List[String] = List(I, 'd, like, to fix, this, sentence, because, it, 's, broken)
scala> l.reduceRight({(s1,s2) => if (s2.startsWith("'")) s1+s2 else s1+" "+s2})
.split(" ").toList
res2: List[String] = List(I'd, like, to, fix, this, sentence, because, it's, broken)
请注意,如果列表为空(由于使用reduceRight
),这将引发异常。
如果发生这种情况,您可能需要使用foldRight
或reduceRightOption
。
答案 1 :(得分:1)
val broken = List("I", "'d", "like", "to", "fix", "this", "sentence", "because", "it", "'s", "broken")
broken.foldLeft(List.empty[String]) { (list, str) =>
if (str.startsWith("'")) {
list.init :+ (list.last + str)
} else {
list :+ str
}
}
(我假设"修复"你的问题中的元素是两个元素而且错误地省略了逗号)
答案 2 :(得分:1)
一种扩展已接受答案的方法,用于处理ca, n't
,
implicit class StanfordNLPConcat(val words: List[String]) extends AnyVal {
def SNLPConcat() = {
val sep = "#"
words.reduce{ (a,v) => if (v.contains("'")) a+v else a+sep+v }.split(sep).toList
}
}
让
val words = List("I", "'d", "like", "to", "fix", "this", "sentence", "because", "it", "'s", "broken")
等等
words.SNLPConcat()
res: List[String] = List(I'd, like, to, fix, this, sentence, because, it's, broken)
此外,
List("It", "ca", "n't", "be", "wrong").SNLPConcat()
res: List[String] = List(It, can't, be, wrong)