我试图找出除z之外的每个其他字母表开头的单词的平均长度。 到目前为止,我有
// words only
val words1 = words.map(_.toLowerCase).filter(x => x.length>0).filter(x => x(0).isLetter)
val allWords = words1.filter(x=> !x.startsWith("z"))// avoiding the z
var mapAllWords= allWords.map(x=> ((x), (x.length)))//mapped it by length.
现在,我想做的就像((A,(2,3,4,.....), (b,(2,4,5,...,9),....)
并按长度获得所有字母的平均值。
我是Scala Programming的新手。
答案 0 :(得分:1)
让我们说这是您的数据:
val words = sc.textFile("README.md").flatMap(_.split("\\s+"))
转换为数据集:
val ds = spark.createDataset(words)
过滤和汇总
ds
// Get first letter and length
.select(
lower(substring($"value", 0, 1)) as "letter", length($"value") as "length")
// Remove non-letters and z
.where($"letter".rlike("^[a-y]"))
// Compute average length
.groupBy("letter")
.avg()
.show
// +------+------------------+
// |letter| avg(length)|
// +------+------------------+
// | l| 7.333333333333333|
// | g|13.846153846153847|
// | m| 9.0|
// | f|3.8181818181818183|
// | n| 3.0|
// | v| 25.4|
// | e| 7.6|
// | o|3.3461538461538463|
// | h| 6.1875|
// | p| 9.0|
// | d| 9.55|
// | y| 3.3|
// | w| 4.0|
// | c| 6.56|
// | u| 4.416666666666667|
// | i| 4.774193548387097|
// | j| 5.0|
// | b| 5.352941176470588|
// | a|3.5526315789473686|
// | r| 4.6|
// +------+------------------+
// only showing top 20 rows
答案 1 :(得分:0)
(没有火花) 一些提示:
val l=List("mario","monica", "renzo","sabrina","sonia","nikola", "enrica","paola")
val couples = l.map(w => (w.charAt(0), w.length))
couples.groupBy(_._1)
.map(x=> ( x._1, (x._2, x._2.size)))
你得到:
l: List[String] = List(mario, monica, renzo, sabrina, sonia, nikola, enrica, paola)
couples: List[(Char, Int)] = List((m,5), (m,6), (r,5), (s,7), (s,5), (n,6), (e,6), (p,5))
res0: scala.collection.immutable.Map[Char,(List[(Char, Int)], Int)] = Map(e -> (List((e,6)),1), s -> (List((s,7), (s,5)),2), n -> (List((n,6)),1), m -> (List((m,5), (m,6)),2), p -> (List((p,5)),1), r -> (List((r,5)),1))
答案 2 :(得分:0)
这是一个 Scala 示例,用于获取以相同字母开头的所有单词的平均大小,我认为您可以轻松适应您的用例。
val sentences = Array("Lester is nice", "Lester is cool", "cool Lester is an awesome dude", "awesome awesome awesome Les")
val sentRDD = sc.parallelize(sentences)
val gbRDD = sentRDD.flatMap(line => line.split(' ')).map(word => (word(0), word.length)).groupByKey(2)
gbRDD.map(wordKVP => (wordKVP._1, wordKVP._2.sum/wordKVP._2.size.toDouble)).collect()
它返回以下内容...
Array((d,4.0), (L,5.25), (n,4.0), (a,6.0), (i,2.0), (c,4.0))
如果您愿意,可以使用 PySpark...
sentences = ['Lester is nice', 'Lester is cool', 'cool Lester is an awesome dude', 'awesome awesome awesome Les']
sentRDD = sc.parallelize(sentences)
gbRDD = sentRDD.flatMap(lambda line: line.split(' ')).map(lambda word: (word[0], len(word))).groupByKey(2)
gbRDD.map(lambda wordKVP: (wordKVP[0], sum(wordKVP[1])/len(wordKVP[1]))).collect()
同样的结果...
[('L', 5.25), ('i', 2.0), ('c', 4.0), ('d', 4.0), ('n', 4.0), ('a', 6.0)]