我只需要保留spark数据集中的元素作为列名“manufacturer”,这是arraylist中存在的元素
完整的数据集
“weiler", "Hi I heard about Spark",
“weiler", "Hi I heard about Spark",
"weiler", "Hi I heard about Spark",
"west chester","I wish Java could use case classes",
"west chester","I wish Java could use case classes",
"west chester","I wish Java could use case classes",
"wells lamont","Logistic,regression,models,are,neat";
在数组列表上应用过滤后需要
“weiler", "Hi I heard about Spark",
“weiler", "Hi I heard about Spark",
"weiler", "Hi I heard about Spark",
"wells lamont","Logistic,regression,models,are,neat";
我正在尝试使用以下代码,但无法理解如何更进一步。
try {
System.setProperty("hadoop.home.dir", "C:\\AD_classfication\\Apachespark\\winutil");
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder()
.appName("JavaTokenizerExample")
.getOrCreate();
List<Row> data = Arrays.asList(
RowFactory.create("weiler", "Hi I heard about Spark"),
RowFactory.create("weiler", "Hi I heard about Spark"),
RowFactory.create("weiler", "Hi I heard about Spark"),
RowFactory.create("west chester","I wish Java could use case classes"),
RowFactory.create("west chester","I wish Java could use case classes"),
RowFactory.create("west chester","I wish Java could use case classes"),
RowFactory.create("wells lamont","Logistic,regression,models,are,neat")
);
StructType schema = new StructType(new StructField[] {
new StructField("manufacturer", DataTypes.IntegerType, false,
Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false,
Metadata.empty())
});
ArrayList<String> uniqueManufacturer = new ArrayList<String>();
uniqueManufacturer.add("weiler");
uniqueManufacturer.add("wells lamont");
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
List<Row> distinctManufacturerNamesList=sentenceDataFrame.filter("manufacturer");
sentenceDataFrame.show();
} catch (Exception e) {
e.printStackTrace();
}