如何在Java中对Spark DataFrame应用字符串操作

时间:2018-03-24 12:30:55

标签: java string apache-spark spark-dataframe

我有一个Spark DataFrame,如下所示:

+--------------------+------+----------------+-----+--------+
|         Name       |   Sex|        Ticket  |Cabin|Embarked|
+--------------------+------+----------------+-----+--------+
|Braund, Mr. Owen ...|  male|       A/5 21171| null|       S|
|Cumings, Mrs. Joh...|female|        PC 17599|  C85|       C|
|Heikkinen, Miss. ...|female|STON/O2. 3101282| null|       S|
|Futrelle, Mrs. Ja...|female|          113803| C123|       S|
|Palsson, Master. ...|  male|          349909| null|       S|
+--------------------+------+----------------+-----+--------+

现在我需要过滤“名称”​​列,使其仅包含标题-i.e.先生,夫人,小姐,师父。因此得到的列将是:

+--------------------+------+----------------+-----+--------+
|         Name       |   Sex|        Ticket  |Cabin|Embarked|
+--------------------+------+----------------+-----+--------+
|Mr.                 |  male|       A/5 21171| null|       S|
|Mrs.                |female|        PC 17599|  C85|       C|
|Miss.               |female|STON/O2. 3101282| null|       S|
|Mrs.                |female|          113803| C123|       S|
|Master.             |  male|          349909| null|       S|
+--------------------+------+----------------+-----+--------+

我尝试应用子字符串操作:

List<String> list = Arrays.asList("Mr.","Mrs.", "Mrs.","Master.");
Dataset<Row> categoricalDF2 = categoricalDF.filter(col("Name").isin(list.stream().toArray(String[]::new)));

但似乎在Java中并不容易。如何在Java中做到这一点。请注意,我使用的是Spark 2.2.0。

2 个答案:

答案 0 :(得分:1)

最后,设法解决它并得到了我自己的问题的答案。我用UDF扩展了Mohit的答案:

private static final UDF1<String, Option<String>> getTitle = (String name) ->      {
    if (name.contains("Mr.")) { // If it has Mr.
        return Some.apply("Mr.");
    } else if (name.contains("Mrs.")) { // Or if has Mrs.
        return Some.apply("Mrs.");
    } else if (name.contains("Miss.")) { // Or if has Miss.
        return Some.apply("Miss.");
    } else if (name.contains("Master.")) { // Or if has Master.
        return Some.apply("Master.");
    } else { // Not any.
        return Some.apply("Untitled");
    }
};

然后我必须按如下方式注册前面的UDF:

SparkSession spark = SparkSession.builder().master("local[*]")
                    .config("spark.sql.warehouse.dir", "/home/martin/")
                    .appName("Titanic")
                    .getOrCreate();
Dataset<Row> df = ....    
spark.sqlContext().udf().register("getTitle", getTitle, DataTypes.StringType);
Dataset<Row> categoricalDF = df.select(callUDF("getTitle", col("Name")).alias("Name"), col("Sex"), col("Ticket"), col("Cabin"), col("Embarked"));
categoricalDF.show();

前面的代码产生以下输出:

+-----+------+----------------+-----+--------+
| Name|   Sex|          Ticket|Cabin|Embarked|
+-----+------+----------------+-----+--------+
|  Mr.|  male|       A/5 21171| null|       S|
| Mrs.|female|        PC 17599|  C85|       C|
|Miss.|female|STON/O2. 3101282| null|       S|
| Mrs.|female|          113803| C123|       S|
|  Mr.|  male|          373450| null|       S|
+-----+------+----------------+-----+--------+
only showing top 5 rows

答案 1 :(得分:0)

我认为以下代码足以完成这项工作。

public class SomeClass {
...

    /**
     * Return the title of the name.
     */
    public String getTitle(String name) {
        if (name.contains("Mr.")) { // If it has Mr.
            return "Mr.";
        } else if (name.contains("Mrs.")) { // Or if has Mrs.
            return "Mrs.";
        } else if (name.contains("Miss.")) { // Or if has Miss.
            return "Miss.";
        } else if (name.contains("Master.")) { // Or if has Master.
            return "Master.";
        } else { // Not any.
            return "Untitled";
        }
    }
}