在Apache Spark中搜索和替换

时间:2017-04-17 18:05:24

标签: join apache-spark apache-spark-sql apache-spark-dataset apache-spark-2.0

我们创建了两个数据集sentenceDataFrame,sentenceDataFrame2,其中搜索替换应该发生。

sentenceDataFrame2存储搜索和替换术语。

我们还执行了所有11种类型的连接'内部','外部','完整','完整外部','左外部','左','右外部','右','左侧','左侧' ,'交叉'他们都没有给我们结果。

您能告诉我们我们要走向何方错误和善意地指出我们正确的方向。

        List<Row> data = Arrays.asList(
            RowFactory.create(0, "Allen jeevi pramod Allen"),
            RowFactory.create(1,"sandesh Armstrong jeevi"),
            RowFactory.create(2,"harsha Nischay DeWALT"));

        StructType schema = new StructType(new StructField[] {
        new StructField("label", DataTypes.IntegerType, false,
          Metadata.empty()),
        new StructField("sentence", DataTypes.StringType, false,
          Metadata.empty()) });
        Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);


        List<Row> data2 = Arrays.asList(
          RowFactory.create("Allen", "Apex Tool Group"),
          RowFactory.create("Armstrong","Apex Tool Group"),
          RowFactory.create("DeWALT","StanleyBlack"));

        StructType schema2 = new StructType(new StructField[] {
        new StructField("label2", DataTypes.StringType, false,
          Metadata.empty()),
        new StructField("sentence2", DataTypes.StringType, false,
          Metadata.empty()) });
        Dataset<Row> sentenceDataFrame2 = spark.createDataFrame(data2, schema2);

        Dataset<Row> remainingElements=sentenceDataFrame.join(sentenceDataFrame2,sentenceDataFrame.col("label").equalTo(sentenceDataFrame2.col("label2")),"cross");
        System.out.println("Left anti join count :"+remainingElements.count());

输入

Allen jeevi pramod Allen
sandesh Armstrong jeevi
harsha Nischay DeWALT

预期输出

Apex工具集团jeevi pramod Apex工具集团
sandesh Apex工具集团jeevi
harsha Nischay StanleyBlack

3 个答案:

答案 0 :(得分:3)

对于不涉及此类简单均衡的连接条件,您将需要使用Spark用户定义函数(UDF)。

这是一个不会直接编译的JUnit代码片段,但显示了相关的导入和逻辑。但是,Java API非常冗长。我将在Scala中这样做的问题留作读者练习。它会更加简洁。

callUDF()col()方法需要静态导入。

import static org.apache.spark.sql.functions.*;

import org.apache.spark.sql.*;
import org.apache.spark.sql.api.java.UDF2;
import org.apache.spark.sql.api.java.UDF3;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

@Test
public void testSomething() {
    List<Row> data = Arrays.asList(
        RowFactory.create(0, "Allen jeevi pramod Allen"),
        RowFactory.create(1, "sandesh Armstrong jeevi"),
        RowFactory.create(2, "harsha Nischay DeWALT")
    );

    StructType schema = new StructType(new StructField[] {
        new StructField("label", DataTypes.IntegerType, false, Metadata.empty()),
        new StructField("sentence", DataTypes.StringType, false, Metadata.empty()) 
    });
    Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);

    List<Row> data2 = Arrays.asList(
        RowFactory.create("Allen", "Apex Tool Group"),
        RowFactory.create("Armstrong","Apex Tool Group"),
        RowFactory.create("DeWALT","StanleyBlack")
    );

    StructType schema2 = new StructType(new StructField[] {
        new StructField("label2", DataTypes.StringType, false, Metadata.empty()),
        new StructField("sentence2", DataTypes.StringType, false, Metadata.empty()) 
    });
    Dataset<Row> sentenceDataFrame2 = spark.createDataFrame(data2, schema2);

    UDF2<String, String, Boolean> contains = new UDF2<String, String, Boolean>() {
        private static final long serialVersionUID = -5239951370238629896L;

        @Override
        public Boolean call(String t1, String t2) throws Exception {
            return t1.contains(t2);
        }
    };
    spark.udf().register("contains", contains, DataTypes.BooleanType);

    UDF3<String, String, String, String> replaceWithTerm = new UDF3<String, String, String, String>() {
        private static final long serialVersionUID = -2882956931420910207L;

        @Override
        public String call(String t1, String t2, String t3) throws Exception {
            return t1.replaceAll(t2, t3);
        }
    };
    spark.udf().register("replaceWithTerm", replaceWithTerm, DataTypes.StringType);

    Dataset<Row> joined = sentenceDataFrame.join(sentenceDataFrame2, callUDF("contains", sentenceDataFrame.col("sentence"), sentenceDataFrame2.col("label2")))
                                           .withColumn("sentence_replaced", callUDF("replaceWithTerm", sentenceDataFrame.col("sentence"), sentenceDataFrame2.col("label2"), sentenceDataFrame2.col("sentence2")))
                                           .select(col("sentence_replaced"));

    joined.show(false);
}

输出:

+--------------------------------------------+
|sentence_replaced                           |
+--------------------------------------------+
|Apex Tool Group jeevi pramod Apex Tool Group|
|sandesh Apex Tool Group jeevi               |
|harsha Nischay StanleyBlack                 |
+--------------------------------------------+

答案 1 :(得分:2)

我们可以使用replaceAll和UDF函数来实现预期的输出。

public class Test {

    public static void main(String[] args) {
        JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
        SQLContext sqlContext = new SQLContext(sc);
        SparkSession spark = SparkSession.builder().appName("JavaTokenizerExample").getOrCreate();

        List<Row> data = Arrays.asList(
        RowFactory.create(0, "Allen jeevi pramod Allen"),
        RowFactory.create(1, "sandesh Armstrong jeevi"),
        RowFactory.create(2, "harsha Nischay DeWALT")
    );

        StructType schema = new StructType(new StructField[] {
        new StructField("label", DataTypes.IntegerType, false,
                Metadata.empty()),
        new StructField("sentence", DataTypes.StringType, false,
                Metadata.empty()) });
        Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
        UDF1 mode = new UDF1<String, String>() {
            public String call(final String types) throws Exception {
                return types.replaceAll("Allen", "Apex Tool Group")
                .replaceAll("Armstrong","Apex Tool Group")
                .replaceAll(""DeWALT","StanleyBlack"")
            }
        };

        sqlContext.udf().register("mode", mode, DataTypes.StringType);

        sentenceDataFrame.createOrReplaceTempView("people");
        Dataset<Row> newDF = sqlContext.sql("SELECT mode(sentence), label FROM people").withColumnRenamed("UDF(sentence)", "sentence");
        newDF.show(false);
}
}

输出

  +--------------------------------------------+------+
  |sentence                                    |label |
  +--------------------------------------------+------+
  |Apex Tool Group jeevi pramod Apex Tool Group|  0   |
  |sandesh Apex Tool Group jeevi               |  1   |
  |harsha Nischay StanleyBlack                 |  2   |
  +--------------------------------------------+------+

答案 2 :(得分:1)

仍面临类似的问题

<强>输入

Allen Armstrong jeevi pramod Allen sandesh Armstrong jeevi
harsha nischay DeWALT

<强>输出

Apex工具集团Armstrong jeevi pramod Apex工具集团
Allen Apex工具集团jeevi pramod Allen
sandesh Apex工具集团jeevi
harsha nischay StanleyBlack

预期输出

Apex工具集团Apex工具集团jeevi pramod Apex工具集团
sandesh Apex工具集团jeevi
harsha nischay StanleyBlack

连续多次替换时获得此输出。

是否还有其他必须遵循的方法才能获得正确的输出。或者这是UDF的限制吗?