Question

我遇到join的问题。我已经从某些CSV文件中加载了数据，并希望将它们加入到Hive中的表中。

我已尝试根据文档进行此操作，但没有成功

我将表格定义为

Dataset<Row> table = SparkSession.sql(query);

并且我想加入它

Dataset<Row> data = SparkSession
    .read()
    .format("csv")
    .option("header", true)
    .option("inferSchema", true)
    .load(path1, path2)

我已经尝试过了

data.join(table, data.col("id1").equalTo(table.col("id2")), "left")

Answer 1

您应该尝试joinWith

data.joinWith(table, data.col("id1").equalTo(table.col("id2"), "left")

参考：https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins.html

编辑：

代替左使用left_outer，左不是joinType，并且There is absolutely no difference between LEFT JOIN and LEFT OUTER JOIN

data.join(table, data.col("id1").equalTo(table.col("id2")), "left_outer")

ref：https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html

join

public Dataset<Row> join(Dataset<?> right,
                scala.collection.Seq<String> usingColumns,
                String joinType)
Equi-join with another DataFrame using the given columns.
Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

Parameters:
right - Right side of the join operation.
usingColumns - Names of the columns to join on. This columns must exist on both sides.
joinType - One of: inner, outer, left_outer, right_outer, leftsemi.

Answer 2

好的，我得到了答案。模式的问题在于，当您要在spark中使用csv时，您需要定义模式，当您连接表时，即使您不想将此字段另存为输出，也需要在已连接的模式键中定义否则将无法正常工作

如何在Hive中将CSV文件与表连接

2 个答案: