在Scala-Spark中的括号之间获取单词和值

时间:2015-04-25 20:03:39

标签: scala matrix apache-spark

这是我的数据:

doc1: (Does,1) (just,-1) (what,0) (was,1) (needed,1) (to,0) (charge,1) (the,0) (Macbook,1)
doc2: (Pro,1) (G4,-1) (13inch,0) (laptop,1)
doc3: (Only,1) (beef,0) (was,1) (it,0) (no,-1) (longer,0) (lights,-1) (up,0) (the,-1)
etc...

我希望提取单词和值,然后将它们存储在两个独立的矩阵中,matrix_1是(docID words),matrix_2是(docID值);

1 个答案:

答案 0 :(得分:0)

input.txt
=========
doc1: (Does,1) (just,-1) (what,0) (was,1) (needed,1) (to,0) (charge,1) (the,0) (Macbook,1)
doc2: (Pro,1) (G4,-1) (13inch,0) (laptop,1)
doc3: (Only,1) (beef,0) (was,1) (it,0) (no,-1) (longer,0) (lights,-1) (up,0) (the,-1)
val inputText = sc.textFile("input.txt")
var digested = input.map(line => line.split(":"))
        .map(row => row(0) -> row(1).trim.split(" "))
        .map(row => row._1 -> row._2.map(_.stripPrefix("(").stripSuffix(")").trim.split(",")))

var matrix_1 = digested.map(row => row._1 -> row._2.map( a => a(0)))
var matrix_2 = digested.map(row => row._1 -> row._2.map( a => a(1)))

给出:

List(
  (doc1 -> Does,just,what,was,needed,to,charge,the,Macbook),
  (doc2 -> Pro,G4,13inch,laptop),
  (doc3 -> Only,beef,was,it,no,longer,lights,up,the)
)

List(
  (doc1 -> 1,-1,0,1,1,0,1,0,1), 
  (doc2 -> 1,-1,0,1), 
  (doc3 -> 1,0,1,0,-1,0,-1,0,-1)
)