在Scala和Spark中根据它们的长度跳过一些行

时间:2015-06-05 15:27:06

标签: scala apache-spark

我有一个包含大量文档的文件,如何跳过那些长度为< = 2的行,然后处理长度为>的行; 2。 例如:

fit perfectly clie .
purchased not
instructions install helpful . improvement battery life not hoped .
product.
cable good not work . cable extremely hot not recognize devices .
跳过行后

fit perfectly clie .
instructions install helpful . improvement battery life not hoped .
cable good not work . cable extremely hot not recognize devices .

我的代码:

 val Bi = text.map(sen=> sen.split(" ").sliding(2))

有没有解决方案?

2 个答案:

答案 0 :(得分:2)

flatMap

怎么样?
text.flatMap(line=>{
  val tokenized = line.split(" ")
  if(tokenized.length > 2) Some(tokenized.sliding(2))
  else None
})

答案 1 :(得分:2)

我会使用过滤器:

> val text = sc.parallelize(Array("fit perfectly clie .",
                                "purchased not",
                                "instructions install helpful . improvement battery life not hoped .",
                                "product.",
                                "cable good not work . cable extremely hot not recognize devices ."))

> val result = text.filter{_.split(" ").size > 2}
> result.collect.foreach{println}

fit perfectly clie .
instructions install helpful . improvement battery life not hoped .
cable good not work . cable extremely hot not recognize devices .

从这里开始,您可以在过滤后以原始形式(即未标记化)处理您的数据。如果你想首先进行标记,那么你可以这样做:

text.map{_.split(" ")}.filter{_.size > 2}

所以,最后,要进行标记化,然后过滤,然后使用sliding查找bigrams,您可以使用:

text.map{_.split(" ")}.filter{_.size > 2}.map{_.sliding(2)}