我们创建了两个数据集sentenceDataFrame,sentenceDataFrame2,其中搜索替换应该发生。
sentenceDataFrame2存储搜索和替换术语。
我们还执行了所有11种类型的连接'内部','外部','完整','完整外部','左外部','左','右外部','右','左侧','左侧' ,'交叉'他们都没有给我们结果。
您能告诉我们我们要走向何方错误和善意地指出我们正确的方向。
List<Row> data = Arrays.asList(
RowFactory.create(0, "Allen jeevi pramod Allen"),
RowFactory.create(1,"sandesh Armstrong jeevi"),
RowFactory.create(2,"harsha Nischay DeWALT"));
StructType schema = new StructType(new StructField[] {
new StructField("label", DataTypes.IntegerType, false,
Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false,
Metadata.empty()) });
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
List<Row> data2 = Arrays.asList(
RowFactory.create("Allen", "Apex Tool Group"),
RowFactory.create("Armstrong","Apex Tool Group"),
RowFactory.create("DeWALT","StanleyBlack"));
StructType schema2 = new StructType(new StructField[] {
new StructField("label2", DataTypes.StringType, false,
Metadata.empty()),
new StructField("sentence2", DataTypes.StringType, false,
Metadata.empty()) });
Dataset<Row> sentenceDataFrame2 = spark.createDataFrame(data2, schema2);
Dataset<Row> remainingElements=sentenceDataFrame.join(sentenceDataFrame2,sentenceDataFrame.col("label").equalTo(sentenceDataFrame2.col("label2")),"cross");
System.out.println("Left anti join count :"+remainingElements.count());
输入
Allen jeevi pramod Allen
sandesh Armstrong jeevi
harsha Nischay DeWALT
预期输出
Apex工具集团jeevi pramod Apex工具集团
sandesh Apex工具集团jeevi
harsha Nischay StanleyBlack
答案 0 :(得分:3)
对于不涉及此类简单均衡的连接条件,您将需要使用Spark用户定义函数(UDF)。
这是一个不会直接编译的JUnit代码片段,但显示了相关的导入和逻辑。但是,Java API非常冗长。我将在Scala中这样做的问题留作读者练习。它会更加简洁。
callUDF()
和col()
方法需要静态导入。
import static org.apache.spark.sql.functions.*;
import org.apache.spark.sql.*;
import org.apache.spark.sql.api.java.UDF2;
import org.apache.spark.sql.api.java.UDF3;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
@Test
public void testSomething() {
List<Row> data = Arrays.asList(
RowFactory.create(0, "Allen jeevi pramod Allen"),
RowFactory.create(1, "sandesh Armstrong jeevi"),
RowFactory.create(2, "harsha Nischay DeWALT")
);
StructType schema = new StructType(new StructField[] {
new StructField("label", DataTypes.IntegerType, false, Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
List<Row> data2 = Arrays.asList(
RowFactory.create("Allen", "Apex Tool Group"),
RowFactory.create("Armstrong","Apex Tool Group"),
RowFactory.create("DeWALT","StanleyBlack")
);
StructType schema2 = new StructType(new StructField[] {
new StructField("label2", DataTypes.StringType, false, Metadata.empty()),
new StructField("sentence2", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> sentenceDataFrame2 = spark.createDataFrame(data2, schema2);
UDF2<String, String, Boolean> contains = new UDF2<String, String, Boolean>() {
private static final long serialVersionUID = -5239951370238629896L;
@Override
public Boolean call(String t1, String t2) throws Exception {
return t1.contains(t2);
}
};
spark.udf().register("contains", contains, DataTypes.BooleanType);
UDF3<String, String, String, String> replaceWithTerm = new UDF3<String, String, String, String>() {
private static final long serialVersionUID = -2882956931420910207L;
@Override
public String call(String t1, String t2, String t3) throws Exception {
return t1.replaceAll(t2, t3);
}
};
spark.udf().register("replaceWithTerm", replaceWithTerm, DataTypes.StringType);
Dataset<Row> joined = sentenceDataFrame.join(sentenceDataFrame2, callUDF("contains", sentenceDataFrame.col("sentence"), sentenceDataFrame2.col("label2")))
.withColumn("sentence_replaced", callUDF("replaceWithTerm", sentenceDataFrame.col("sentence"), sentenceDataFrame2.col("label2"), sentenceDataFrame2.col("sentence2")))
.select(col("sentence_replaced"));
joined.show(false);
}
输出:
+--------------------------------------------+
|sentence_replaced |
+--------------------------------------------+
|Apex Tool Group jeevi pramod Apex Tool Group|
|sandesh Apex Tool Group jeevi |
|harsha Nischay StanleyBlack |
+--------------------------------------------+
答案 1 :(得分:2)
我们可以使用replaceAll和UDF函数来实现预期的输出。
public class Test {
public static void main(String[] args) {
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder().appName("JavaTokenizerExample").getOrCreate();
List<Row> data = Arrays.asList(
RowFactory.create(0, "Allen jeevi pramod Allen"),
RowFactory.create(1, "sandesh Armstrong jeevi"),
RowFactory.create(2, "harsha Nischay DeWALT")
);
StructType schema = new StructType(new StructField[] {
new StructField("label", DataTypes.IntegerType, false,
Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false,
Metadata.empty()) });
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
UDF1 mode = new UDF1<String, String>() {
public String call(final String types) throws Exception {
return types.replaceAll("Allen", "Apex Tool Group")
.replaceAll("Armstrong","Apex Tool Group")
.replaceAll(""DeWALT","StanleyBlack"")
}
};
sqlContext.udf().register("mode", mode, DataTypes.StringType);
sentenceDataFrame.createOrReplaceTempView("people");
Dataset<Row> newDF = sqlContext.sql("SELECT mode(sentence), label FROM people").withColumnRenamed("UDF(sentence)", "sentence");
newDF.show(false);
}
}
输出
+--------------------------------------------+------+
|sentence |label |
+--------------------------------------------+------+
|Apex Tool Group jeevi pramod Apex Tool Group| 0 |
|sandesh Apex Tool Group jeevi | 1 |
|harsha Nischay StanleyBlack | 2 |
+--------------------------------------------+------+
答案 2 :(得分:1)
仍面临类似的问题
<强>输入强>
Allen Armstrong jeevi pramod Allen sandesh Armstrong jeevi<强>输出强>
Apex工具集团Armstrong jeevi pramod Apex工具集团
Allen Apex工具集团jeevi pramod Allen
sandesh Apex工具集团jeevi
harsha nischay StanleyBlack
预期输出
Apex工具集团Apex工具集团jeevi pramod Apex工具集团
sandesh Apex工具集团jeevi
harsha nischay StanleyBlack
连续多次替换时获得此输出。
是否还有其他必须遵循的方法才能获得正确的输出。或者这是UDF的限制吗?