给出以下数据集:
| title | start | end
| bla | 10 | 30
我想找出两个数字之间的差(开始-结束)并将它们设置在新列中,这样看起来就像:
| title | time_spent |
| bla | 20 |
数据的类型为Dataset<Row>
dataset = dataset.withColumn("millis spent: ", col("end") - col("start")).as("Time spent");
正如我在this问题中所看到的那样,我希望它能起作用,但是它确实可以,也许是因为该线程是关于DataFrames而不是DataSet的,或者是因为Scala允许它在Java中是非法的?
答案 0 :(得分:3)
您可以考虑使用静态方法。简而言之:
import static org.apache.spark.sql.functions.expr;
...
df = df
.withColumn("time_spent", expr("end - start"))
.drop("start")
.drop("end");
expr()
将评估您列中的值。
这是正确导入的完整示例。抱歉,该示例的大部分内容是关于创建数据框的。
package net.jgp.books.sparkInAction.ch12.lab990Others;
import static org.apache.spark.sql.functions.expr;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
/**
* Use of expr().
*
* @author jgp
*/
public class ExprApp {
/**
* main() is your entry point to the application.
*
* @param args
*/
public static void main(String[] args) {
ExprApp app = new ExprApp();
app.start();
}
/**
* The processing code.
*/
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("All joins!")
.master("local")
.getOrCreate();
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(
"title",
DataTypes.StringType,
false),
DataTypes.createStructField(
"start",
DataTypes.IntegerType,
false),
DataTypes.createStructField(
"end",
DataTypes.IntegerType,
false) });
List<Row> rows = new ArrayList<Row>();
rows.add(RowFactory.create("bla", 10, 30));
Dataset<Row> df = spark.createDataFrame(rows, schema);
df.show();
df = df
.withColumn("time_spent", expr("end - start"))
.drop("start")
.drop("end");
df.show();
}
}