Apache spark中的数据集上有多个过滤器

时间:2017-11-13 13:58:23

标签: java apache-spark filter rdd apache-spark-mllib

我只需要保留spark数据集中的元素作为列名“manufacturer”,这是arraylist中存在的元素

完整的数据集

“weiler", "Hi I heard about Spark",
“weiler", "Hi I heard about Spark",
"weiler", "Hi I heard about Spark",
"west chester","I wish Java could use case classes",
"west chester","I wish Java could use case classes",
"west chester","I wish Java could use case classes",
"wells lamont","Logistic,regression,models,are,neat";

在数组列表上应用过滤后需要

“weiler", "Hi I heard about Spark",
“weiler", "Hi I heard about Spark",
"weiler", "Hi I heard about Spark",
"wells lamont","Logistic,regression,models,are,neat";

我正在尝试使用以下代码,但无法理解如何更进一步。

try {                
    System.setProperty("hadoop.home.dir", "C:\\AD_classfication\\Apachespark\\winutil");
    JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
    SQLContext sqlContext = new SQLContext(sc);
    SparkSession spark = SparkSession.builder()
                                     .appName("JavaTokenizerExample")
                                     .getOrCreate();

    List<Row> data = Arrays.asList(
                          RowFactory.create("weiler", "Hi I heard about Spark"),
                          RowFactory.create("weiler", "Hi I heard about Spark"),
                          RowFactory.create("weiler", "Hi I heard about Spark"),
                          RowFactory.create("west chester","I wish Java could use case classes"),
                          RowFactory.create("west chester","I wish Java could use case classes"),
                          RowFactory.create("west chester","I wish Java could use case classes"),
                          RowFactory.create("wells lamont","Logistic,regression,models,are,neat")
                    );

    StructType schema = new StructType(new StructField[] {
                            new StructField("manufacturer", DataTypes.IntegerType, false,
                                      Metadata.empty()),
                            new StructField("sentence", DataTypes.StringType, false,
                                      Metadata.empty()) 
                        });

    ArrayList<String> uniqueManufacturer = new ArrayList<String>();
    uniqueManufacturer.add("weiler");
    uniqueManufacturer.add("wells lamont");


    Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
    List<Row> distinctManufacturerNamesList=sentenceDataFrame.filter("manufacturer");
    sentenceDataFrame.show();
} catch (Exception e) {
        e.printStackTrace();
}

0 个答案:

没有答案