答案

Question

希望有人可以提供帮助。相当肯定这是我做错了。

我有一个名为uuidvar的数据框，其中一列名为'uuid'，另一个数据框df1有多列，其中一列也是'uuid'。我想从df1中选择所有具有出现在uuidvar中的uuid的行。现在，具有相同的列名称并不理想，因此我尝试使用

val uuidselection=df1.join(uuidvar, df1("uuid") === uuidvar("uuid").as("another_uuid"), "right_outer").select("*")

然而，当我显示uuidselection时，我有2列称为“uuid”。此外，如果我尝试选择我想要的特定列，我会被告知

cannot resolve 'uuidvar' given input columns

或类似的，取决于我尝试和选择。

我试图让它更简单，只是做

val uuidvar2=uuidvar.select("uuid").as("uuidvar")

并且这不会重命名uuidvar中的列。

'as'是否不像我期望的那样运作，我是在做一些其他根本性的错误还是它被打破了？

我正在使用spark 1.5.1和scala 1.10。

Answer 1

我一直使用withColumnRenamed api来重命名列：

以此表为例：

|名称|年龄|

df.withColumnRenamed('Age', 'newAge').show()

|名称| newAge |

因此，要使其与您的代码一起使用，这样的事情应该有效：

val uuidvar_another = uuidvar.withColumnRenamed("uuid", "another_uuid")
val uuidselection=df1.join(uuidvar, df1("uuid") === uuidvar("another_uuid"), "right_outer").select("*")

Answer 2

答案

指定加入条件时，您无法使用as。在加入之前使用withColumnRenamed修改列。 Seccnd，使用通用col函数通过名称访问列（而不是使用数据框＆＃39; apply方法，例如df1(<columnname>)

案例类UUID1（uuid：String） case class UUID2（uuid：String，b：Int）

class UnsortedTestSuite2 extends SparkFunSuite {
  configuredUnitTest("SO - uuid") { sc =>
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._

    val uuidvar = sc.parallelize( Seq(
      UUID1("cafe-babe-001"),
      UUID1("cafe-babe-002"),
      UUID1("cafe-babe-003"),
      UUID1("cafe-babe-004")
    )).toDF()

    val df1 = sc.parallelize( Seq(
      UUID2("cafe-babe-001", 1),
      UUID2("cafe-babe-002", 2),
      UUID2("cafe-babe-003", 3)
    )).toDF()


    val uuidselection=df1.join(uuidvar.withColumnRenamed("uuid", "another_uuid"), col("uuid") === col("another_uuid"), "right_outer")

    uuidselection.show()
  }
}

递送

+-------------+----+-------------+
|         uuid|   b| another_uuid|
+-------------+----+-------------+
|cafe-babe-001|   1|cafe-babe-001|
|cafe-babe-002|   2|cafe-babe-002|
|cafe-babe-003|   3|cafe-babe-003|
|         null|null|cafe-babe-004|
+-------------+----+-------------+

注释

.select("*")没有任何效果。所以

df.select("*")    =^=        df

Scala spark选择不按预期工作

2 个答案:

答案

注释