我在spark java中使用isin函数并传递id列表,需要按照传递的列表的顺序检索id。但是在使用isin函数后,Order会发生变化,它维持数据集的顺序。
如何保留列表中的订单?
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder().appName("JavaTokenizerExample").getOrCreate();
RowFactory.create("405-048011-62815", "CRC Industries"),
RowFactory.create("630-0746","Dixon value"),
RowFactory.create("4444-444","3M INdustries"),
RowFactory.create("4333-444","3M INdustries"),
RowFactory.create("4777-444","3M INdustries"),
RowFactory.create("4444-888","3M INdustries"),
RowFactory.create("4999-444","3M INdustries"),
RowFactory.create("5666-55","Dixon coupling valve"));
StructType schema = new StructType(new StructField[] {new StructField("label1", DataTypes.StringType, false,Metadata.empty()),
new StructField("sentence1", DataTypes.StringType, false,Metadata.empty()) });
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
List<String> listStrings = new ArrayList<String>();
listStrings.add("5666-55");
listStrings.add("630-0746");
listStrings.add("4777-444");
listStrings.add("4444-444");
Dataset<Row> matchFound1=sentenceDataFrame.filter(col("label1").isin(listStrings.stream().toArray(String[]::new)));
matchFound1.show();
当前输出:
+--------+--------------------+
| label1| sentence1|
+--------+--------------------+
|630-0746| Dixon value|
|4444-444| 3M INdustries|
|4777-444| 3M INdustries|
| 5666-55|Dixon coupling valve|
+--------+--------------------+
预期产出:
+--------+--------------------+
| label1| sentence1|
+--------+--------------------+
| 5666-55|Dixon coupling valve|
|630-0746| Dixon value|
|4777-444| 3M INdustries|
|4444-444| 3M INdustries|
+--------+--------------------+
答案 0 :(得分:0)
我建议您创建一个dataframe
而不是list
并加入sentenceDataFrame
,您也会保留订单。这比创建list
和filtering
更有效。