Question

使用Spark 1.6.1版本我需要在列上获取不同的值，然后在其上执行一些特定的转换。该列包含超过5000万条记录，并且可以变大我知道做一个distinct.collect()会将调用带回驱动程序。目前我正在执行如下任务，是否有更好的方法？

 import sqlContext.implicits._
 preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2)

 preProcessedData.select(ApplicationId).distinct.collect().foreach(x => {
   val applicationId = x.getAs[String](ApplicationId)
   val selectedApplicationData = preProcessedData.filter($"$ApplicationId" === applicationId)
   // DO SOME TASK PER applicationId
 })

 preProcessedData.unpersist()

Answer 1

为了获得Dataframe中的所有不同值，您可以使用distinct。正如您在文档中看到的那样，该方法返回另一个DataFrame。之后，您可以创建UDF以转换每条记录。

例如：

val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("age", "salary")

// I obtain all different values. If you show you must see only {1, 3}
val distinctValuesDF = df.select(df("age")).distinct

// Define your udf. In this case I defined a simple function, but they can get complicated.
val myTransformationUDF = udf(value => value / 10)

// Run that transformation "over" your DataFrame
val afterTransformationDF = distinctValuesDF.select(myTransformationUDF(col("age")))

Answer 2

此解决方案演示了如何使用优于 UDF 的 Spark 原生函数来转换数据。它还演示了 dropDuplicates 如何比 distinct 更适合某些查询。

假设你有这个 DataFrame：

+-------+-------------+
|country|    continent|
+-------+-------------+
|  china|         asia|
| brazil|south america|
| france|       europe|
|  china|         asia|
+-------+-------------+

以下是获取所有不同国家并进行转型的方法：

df
  .select("country")
  .distinct
  .withColumn("country", concat(col("country"), lit(" is fun!")))
  .show()

+--------------+
|       country|
+--------------+
|brazil is fun!|
|france is fun!|
| china is fun!|
+--------------+

如果您不想丢失 dropDuplicates 信息，可以使用 distinct 代替 continent：

df
  .dropDuplicates("country")
  .withColumn("description", concat(col("country"), lit(" is a country in "), col("continent")))
  .show(false)

+-------+-------------+------------------------------------+
|country|continent    |description                         |
+-------+-------------+------------------------------------+
|brazil |south america|brazil is a country in south america|
|france |europe       |france is a country in europe       |
|china  |asia         |china is a country in asia          |
+-------+-------------+------------------------------------+

有关过滤数据帧和 here 的详细信息，请参阅 here for more information on dropping duplicates。

最终，您需要将转换逻辑包装在可以与 Dataset#transform 方法链接的自定义转换中。

Answer 3

在 Pyspark 中试试这个，

df.select('col_name').distinct().show()

Answer 4

df =  df.select("column1", "column2",....,..,"column N").distinct.[].collect()

在空列表中，如果您希望 df 为 JSON 格式，您可以插入 [to_JSON()] 之类的值。

使用Spark DataFrame在列上获取不同的值

4 个答案: