Question

我有两个数据帧1是books1 with Schema

root
|-- asin: string (nullable = true)
|-- helpful: array (nullable = true)
|    |-- element: long (containsNull = true)
|-- overall: double (nullable = true)
|-- reviewText: string (nullable = true)
|-- reviewTime: string (nullable = true)
|-- reviewerID: string (nullable = true)
|-- reviewerName: string (nullable = true)
|-- summary: string (nullable = true)
|-- unixReviewTime: long (nullable = true)

另一个是带有架构的标签

root
 |-- value: integer (nullable = false)

books1和label包含

但现在当我使用join命令加入时，

var bookdf = books1.join(label) 输出不正确

值字段应该包含2,6,0，但它只包含2个为什么它发生no。两个数据帧中的行是相同的

Answer 1

您无法join两个数据帧，而无需提供加入表达式

如果两个数据框的行数相同，那么您可以创建一个新列id，这对于两个数据帧都是row number

val newBookDF = spark.sqlContext.createDataFrame(
  book1.rdd.zipWithIndex.map {
    case (row, index) => Row.fromSeq(row.toSeq :+ index)
  },
  // Create schema for index column
  StructType(book1.schema.fields :+ StructField("index", LongType, false))
)

同样适用于label数据框

val newLabelDF = spark.sqlContext.createDataFrame(
  label.rdd.zipWithIndex.map {
    case (row, index) => Row.fromSeq(row.toSeq :+ index)
  },
  // Create schema for index column
  StructType(label.schema.fields :+ StructField("index", LongType, false))
)

现在你可以join最后两个数据帧，如

newBookDF.join(newLabelDF, Seq("id")).drop("id")

这将为您提供预期的结果

在连接两个数据帧之后，它没有给出准确的值

1 个答案: