我有一个Spark DataFrame,如下所示:
+--------------------+------+----------------+-----+--------+
| Name | Sex| Ticket |Cabin|Embarked|
+--------------------+------+----------------+-----+--------+
|Braund, Mr. Owen ...| male| A/5 21171| null| S|
|Cumings, Mrs. Joh...|female| PC 17599| C85| C|
|Heikkinen, Miss. ...|female|STON/O2. 3101282| null| S|
|Futrelle, Mrs. Ja...|female| 113803| C123| S|
|Palsson, Master. ...| male| 349909| null| S|
+--------------------+------+----------------+-----+--------+
现在我需要过滤“名称”列,使其仅包含标题-i.e.先生,夫人,小姐,师父。因此得到的列将是:
+--------------------+------+----------------+-----+--------+
| Name | Sex| Ticket |Cabin|Embarked|
+--------------------+------+----------------+-----+--------+
|Mr. | male| A/5 21171| null| S|
|Mrs. |female| PC 17599| C85| C|
|Miss. |female|STON/O2. 3101282| null| S|
|Mrs. |female| 113803| C123| S|
|Master. | male| 349909| null| S|
+--------------------+------+----------------+-----+--------+
我尝试应用子字符串操作:
List<String> list = Arrays.asList("Mr.","Mrs.", "Mrs.","Master.");
Dataset<Row> categoricalDF2 = categoricalDF.filter(col("Name").isin(list.stream().toArray(String[]::new)));
但似乎在Java中并不容易。如何在Java中做到这一点。请注意,我使用的是Spark 2.2.0。
答案 0 :(得分:1)
最后,设法解决它并得到了我自己的问题的答案。我用UDF扩展了Mohit的答案:
private static final UDF1<String, Option<String>> getTitle = (String name) -> {
if (name.contains("Mr.")) { // If it has Mr.
return Some.apply("Mr.");
} else if (name.contains("Mrs.")) { // Or if has Mrs.
return Some.apply("Mrs.");
} else if (name.contains("Miss.")) { // Or if has Miss.
return Some.apply("Miss.");
} else if (name.contains("Master.")) { // Or if has Master.
return Some.apply("Master.");
} else { // Not any.
return Some.apply("Untitled");
}
};
然后我必须按如下方式注册前面的UDF:
SparkSession spark = SparkSession.builder().master("local[*]")
.config("spark.sql.warehouse.dir", "/home/martin/")
.appName("Titanic")
.getOrCreate();
Dataset<Row> df = ....
spark.sqlContext().udf().register("getTitle", getTitle, DataTypes.StringType);
Dataset<Row> categoricalDF = df.select(callUDF("getTitle", col("Name")).alias("Name"), col("Sex"), col("Ticket"), col("Cabin"), col("Embarked"));
categoricalDF.show();
前面的代码产生以下输出:
+-----+------+----------------+-----+--------+
| Name| Sex| Ticket|Cabin|Embarked|
+-----+------+----------------+-----+--------+
| Mr.| male| A/5 21171| null| S|
| Mrs.|female| PC 17599| C85| C|
|Miss.|female|STON/O2. 3101282| null| S|
| Mrs.|female| 113803| C123| S|
| Mr.| male| 373450| null| S|
+-----+------+----------------+-----+--------+
only showing top 5 rows
答案 1 :(得分:0)
我认为以下代码足以完成这项工作。
public class SomeClass {
...
/**
* Return the title of the name.
*/
public String getTitle(String name) {
if (name.contains("Mr.")) { // If it has Mr.
return "Mr.";
} else if (name.contains("Mrs.")) { // Or if has Mrs.
return "Mrs.";
} else if (name.contains("Miss.")) { // Or if has Miss.
return "Miss.";
} else if (name.contains("Master.")) { // Or if has Master.
return "Master.";
} else { // Not any.
return "Untitled";
}
}
}