Spark按字母顺序获取所有单词的平均长度

时间:2018-02-02 20:00:56

标签: scala apache-spark

我试图找出除z之外的每个其他字母表开头的单词的平均长度。 到目前为止,我有

// words only
val words1 = words.map(_.toLowerCase).filter(x => x.length>0).filter(x => x(0).isLetter)

val allWords = words1.filter(x=> !x.startsWith("z"))// avoiding the z
var mapAllWords= allWords.map(x=> ((x), (x.length)))//mapped it by length.

现在,我想做的就像((A,(2,3,4,.....), (b,(2,4,5,...,9),....) 并按长度获得所有字母的平均值。 我是Scala Programming的新手。

3 个答案:

答案 0 :(得分:1)

让我们说这是您的数据:

val words = sc.textFile("README.md").flatMap(_.split("\\s+"))

转换为数据集:

val ds = spark.createDataset(words)

过滤和汇总

ds
  // Get first letter and length
  .select(
    lower(substring($"value", 0, 1)) as "letter", length($"value") as "length")
  // Remove non-letters and z
  .where($"letter".rlike("^[a-y]"))
  // Compute average length 
  .groupBy("letter")
  .avg()
  .show
// +------+------------------+
// |letter|       avg(length)|
// +------+------------------+
// |     l| 7.333333333333333|
// |     g|13.846153846153847|
// |     m|               9.0|
// |     f|3.8181818181818183|
// |     n|               3.0|
// |     v|              25.4|
// |     e|               7.6|
// |     o|3.3461538461538463|
// |     h|            6.1875|
// |     p|               9.0|
// |     d|              9.55|
// |     y|               3.3|
// |     w|               4.0|
// |     c|              6.56|
// |     u| 4.416666666666667|
// |     i| 4.774193548387097|
// |     j|               5.0|
// |     b| 5.352941176470588|
// |     a|3.5526315789473686|
// |     r|               4.6|
// +------+------------------+
// only showing top 20 rows

答案 1 :(得分:0)

scala中的

(没有火花) 一些提示:

val l=List("mario","monica", "renzo","sabrina","sonia","nikola", "enrica","paola")

val couples = l.map(w => (w.charAt(0), w.length))

couples.groupBy(_._1)
       .map(x=> ( x._1, (x._2, x._2.size)))

你得到:

l: List[String] = List(mario, monica, renzo, sabrina, sonia, nikola, enrica, paola)

couples: List[(Char, Int)] = List((m,5), (m,6), (r,5), (s,7), (s,5), (n,6), (e,6), (p,5))

res0: scala.collection.immutable.Map[Char,(List[(Char, Int)], Int)] = Map(e -> (List((e,6)),1), s -> (List((s,7), (s,5)),2), n -> (List((n,6)),1), m -> (List((m,5), (m,6)),2), p -> (List((p,5)),1), r -> (List((r,5)),1))

答案 2 :(得分:0)

这是一个 Scala 示例,用于获取以相同字母开头的所有单词的平均大小,我认为您可以轻松适应您的用例。

val sentences = Array("Lester is nice", "Lester is cool", "cool Lester is an awesome dude", "awesome awesome awesome Les")
val sentRDD = sc.parallelize(sentences)

val gbRDD = sentRDD.flatMap(line => line.split(' ')).map(word => (word(0), word.length)).groupByKey(2)

gbRDD.map(wordKVP => (wordKVP._1, wordKVP._2.sum/wordKVP._2.size.toDouble)).collect()

它返回以下内容...

Array((d,4.0), (L,5.25), (n,4.0), (a,6.0), (i,2.0), (c,4.0))

如果您愿意,可以使用 PySpark...

sentences = ['Lester is nice', 'Lester is cool', 'cool Lester is an awesome dude', 'awesome awesome awesome Les']
sentRDD = sc.parallelize(sentences)

gbRDD = sentRDD.flatMap(lambda line: line.split(' ')).map(lambda word: (word[0], len(word))).groupByKey(2)

gbRDD.map(lambda wordKVP: (wordKVP[0], sum(wordKVP[1])/len(wordKVP[1]))).collect()

同样的结果...

[('L', 5.25), ('i', 2.0), ('c', 4.0), ('d', 4.0), ('n', 4.0), ('a', 6.0)]