在Spark + Java中从String映射到Tuple2 <string,long =“”>

时间:2018-10-26 03:28:56

标签: java apache-spark

我正在尝试学习如何使用Spark(以Java编写代码(请不要使用Scala代码))。我正在尝试实现非常简单的 hello world 示例Spark,字数统计。

我已经从Spark的文档quick start中借用了代码:

/* SimpleApp.java */
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;

public class SimpleApp {
  public static void main(String[] args) {
    String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system
    SparkSession spark = SparkSession.builder().appName("Simple Application").getOrCreate();
    Dataset<String> logData = spark.read().textFile(logFile).cache();

    long numAs = logData.filter(s -> s.contains("a")).count();
    long numBs = logData.filter(s -> s.contains("b")).count();

    System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);

    spark.stop();
  }
}

一切都很好,现在我想将filter替换为flatMap,然后将map替换为flatMap。到目前为止,我已经收到 logData.flatMap((FlatMapFunction<String, String>) l -> { return Arrays.asList(l.split(" ")).iterator(); }, Encoders.STRING());

(word, 1)

现在,我想将每个单词映射到Tuple2 String,然后按键对它们进行分组。但是问题是我找不到从(String, Long)mapToPair的方法。大多数文档都谈论Dataset,但是String没有这种方法!

有人可以帮助我将Tuple2<String, Long>映射到Tuple2吗?顺便说一句,我什至不确定我在寻找 logData.flatMap((FlatMapFunction<String, String>) l -> { return Arrays.asList(l.split(" ")).iterator(); }, Encoders.STRING()) .map(new Function<String, Tuple2<String, Long>>() { public Tuple2<String, Long> call(String str) { return new Tuple2<String, Long>(str, 1L); } }) .count() 还是其他课程。

[更新]

根据@mangusta提供的建议,我尝试了以下方法:

Error:(108, 17) java: no suitable method found for map(<anonymous org.apache.spark.api.java.function.Function<java.lang.String,scala.Tuple2<java.lang.String,java.lang.Long>>>)
    method org.apache.spark.sql.Dataset.<U>map(scala.Function1<java.lang.String,U>,org.apache.spark.sql.Encoder<U>) is not applicable
      (cannot infer type-variable(s) U
        (actual and formal argument lists differ in length))
    method org.apache.spark.sql.Dataset.<U>map(org.apache.spark.api.java.function.MapFunction<java.lang.String,U>,org.apache.spark.sql.Encoder<U>) is not applicable
      (cannot infer type-variable(s) U
        (actual and formal argument lists differ in length))

并遇到此编译错误:

map

好像- alert: NodeMemory Usage(development) annotations: description: '{{$labels.instance}} Memory usage is critical (current value is: {{ $value }})' summary: High Memory usage detected expr: | 1 - sum by(node) ((node_memory_MemFree{job="node-exporter"} + node_memory_Cached{job="node-exporter"} + node_memory_Buffers{job="node-exporter"}) * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:) / sum by(node) (node_memory_MemTotal{job="node-exporter"}* on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:) > 0.70 for: 1s labels: severity: warning 函数接受两个参数。我不确定应该作为第二个参数传递什么。

3 个答案:

答案 0 :(得分:1)

如果您需要使用Tuple2,则应使用Scala Java库,即scala-library.jar

要从某些JavaRDD<String> data中准备元组,可以将以下函数应用于该RDD:

JavaRDD<Tuple2<String,Long>> tupleRDD  =  data.map(

new Function<String, Tuple2<String, Long>>() {

            public Tuple2<String, Long> call(String str) {

              return new Tuple2<String, Long>(str, 1L);

            }//end call

          }//end function

        );//end map

答案 1 :(得分:1)

我不确定错误的原因,但是您可以尝试以下代码

final String sparkHome = "/usr/local/Cellar/apache-spark/2.3.2";
SparkConf conf = new SparkConf()
        .setMaster("local[*]")
        .setAppName("spark-example")
        .setSparkHome(sparkHome + "/libexec");

SparkSession spark = SparkSession.builder().config(conf).getOrCreate();
Dataset<Row> df = spark.read().textFile(sparkHome + "/README.md")
        .flatMap(line -> Arrays.asList(line.split(" ")).iterator(), Encoders.STRING())
        .filter(s -> !s.isEmpty())
        .map(word -> new Tuple2<>(word.toLowerCase(), 1L), Encoders.tuple(Encoders.STRING(), Encoders.LONG()))
        .toDF("word", "count")
        .groupBy("word")
        .sum("count").orderBy(new Column("sum(count)").desc()).withColumnRenamed("sum(count)", "_cnt");

df.show(false);

您应该期待此输出

+-------------+----+
|word         |_cnt|
+-------------+----+
|the          |25  |
|to           |19  |
|spark        |16  |
|for          |15  |
|and          |10  |
|a            |9   |
|##           |9   |
|you          |8   |
|run          |7   |
|on           |7   |
|can          |7   |
|is           |6   |
|in           |6   |
|of           |5   |
|using        |5   |
|including    |4   |
|if           |4   |
|with         |4   |
|documentation|4   |
|an           |4   |
+-------------+----+
only showing top 20 rows

答案 2 :(得分:0)

试试这个

 logData.flatMap((FlatMapFunction<String,String>)line -> Arrays.asList(line.split(" ")).iterator(), Encoders.STRING()).groupBy("value").count().show();