我有一个包含大量文档的文件,如何跳过那些长度为< = 2的行,然后处理长度为>的行; 2。 例如:
fit perfectly clie .
purchased not
instructions install helpful . improvement battery life not hoped .
product.
cable good not work . cable extremely hot not recognize devices .
跳过行后:
fit perfectly clie .
instructions install helpful . improvement battery life not hoped .
cable good not work . cable extremely hot not recognize devices .
我的代码:
val Bi = text.map(sen=> sen.split(" ").sliding(2))
有没有解决方案?
答案 0 :(得分:2)
flatMap
text.flatMap(line=>{
val tokenized = line.split(" ")
if(tokenized.length > 2) Some(tokenized.sliding(2))
else None
})
答案 1 :(得分:2)
我会使用过滤器:
> val text = sc.parallelize(Array("fit perfectly clie .",
"purchased not",
"instructions install helpful . improvement battery life not hoped .",
"product.",
"cable good not work . cable extremely hot not recognize devices ."))
> val result = text.filter{_.split(" ").size > 2}
> result.collect.foreach{println}
fit perfectly clie .
instructions install helpful . improvement battery life not hoped .
cable good not work . cable extremely hot not recognize devices .
从这里开始,您可以在过滤后以原始形式(即未标记化)处理您的数据。如果你想首先进行标记,那么你可以这样做:
text.map{_.split(" ")}.filter{_.size > 2}
所以,最后,要进行标记化,然后过滤,然后使用sliding
查找bigrams,您可以使用:
text.map{_.split(" ")}.filter{_.size > 2}.map{_.sliding(2)}