Question

我有一个如下文件：

0; best wrap ear market pair pair break make

1; time sennheiser product better earphone fit 

1; recommend headphone pretty decent full sound earbud design 

0; originally buy work gym work well robust sound quality good clip 

1; terrific sound great fit toss mine profuse sweater headphone

0; negative experienced sit chair back touch chair earplug displace hurt
...

我希望提取数字并将其存储在每个文档中，我已尝试过：

  var grouped_with_wt = data.flatMap({ (line) =>
    val words = line.split(";").split(" ")
    words.map(w => {
      val a = 
      (line.hashCode(),(vocab_lookup.value(w), a))
    })
  }).groupByKey()

预期输出为：

(1453543,(best,0),(wrap,0),(ear,0),(market,0),(pair,0),(break,0),(make,0))
(3942334,(time,1),(sennheiser,1),(product,1),(better,1),(earphone,1),(fit,1))
...

生成上述结果后，我在此代码中使用它们来生成最终结果：

   val Beta = DenseMatrix.zeros[Int](V, S)
      val Beta_c = grouped_with_wt.flatMap(kv => {
        kv._2.map(wt => {
          Beta(wt._1,wt._2) +=1
        })
      })

最终结果：

这段代码不能很好地运作，有人可以帮帮我吗？我想要一个像上面这样的代码。

Answer 1

val inputRDD = sc.textFile("input dir ")
val outRDD = inputRDD.map(r => {
    val tuple = r.split(";")
    val key = tuple(0)
    val words = tuple(1).trim().split(" ")
    val outArr = words.map(w => {
        new Tuple2(w,key)
    })
    (r.hashCode, outArr.mkString(","))
})
outRDD.saveAsTextFile("output dir")

输出

(-1704185638,(best,0),(wrap,0),(ear,0),(market,0),(pair,0),(pair,0),(break,0),(make,0))
(147969209,(time,5),(sennheiser,5),(product,5),(better,5),(earphone,5),(fit,5))
(1145947974,(recommend,1),(headphone,1),(pretty,1),(decent,1),(full,1),(sound,1),(earbud,1),(design,1))
(838871770,(originally,4),(buy,4),(work,4),(gym,4),(work,4),(well,4),(robust,4),(sound,4),(quality,4),(good,4),(clip,4))
(934228708,(terrific,5),(sound,5),(great,5),(fit,5),(toss,5),(mine,5),(profuse,5),(sweater,5),(headphone,5))
(659513416,(negative,-3),(experienced,-3),(sit,-3),(chair,-3),(back,-3),(touch,-3),(chair,-3),(earplug,-3),(displace,-3),(hurt,-3))

提取数字并将它们存储在Scala和Spark中的变量中

1 个答案: