Spark数据框:选择不同的行

时间:2019-03-05 19:47:09

标签: java apache-spark apache-spark-sql

我尝试了两种方法从镶木地板中查找不同的行,但似乎不起作用。
尝试1: Dataset<Row> df = sqlContext.read().parquet("location.parquet").distinct();
但是抛出

Cannot have map type columns in DataFrame which calls set operations
(intersect, except, etc.), 
but the type of column canvasHashes is map<string,string>;;

尝试2: 尝试运行SQL查询:

Dataset<Row> df = sqlContext.read().parquet("location.parquet");
    rawLandingDS.createOrReplaceTempView("df");
    Dataset<Row> landingDF = sqlContext.sql("SELECT distinct on timestamp * from df");

我得到的错误:

= SQL ==
SELECT distinct on timestamp * from df
-----------------------------^^^

在读取镶木地板文件时是否有获取不同记录的方法?我可以使用的任何读取选项。

3 个答案:

答案 0 :(得分:3)

您遇到的问题在异常消息中明确说明-因为MapType列既不可散列也不可排序,不能用作分组或分区表达式的一部分。

在逻辑上,您对SQL解决方案的看法与distinct上的Dataset不相同。如果要基于一组兼容的列对数据进行重复数据删除,则应使用dropDuplicates

df.dropDuplicates("timestamp")

等同于

SELECT timestamp, first(c1) AS c1, first(c2) AS c2,  ..., first(cn) AS cn,
       first(canvasHashes) AS canvasHashes
FROM df GROUP BY timestamp

不幸的是,如果您的目标是实际的DISTINCT,这将不会那么容易。一种可能的解决方案是利用Scala * Map哈希。您可以这样定义 Scala udf

spark.udf.register("scalaHash", (x: Map[String, String]) => x.##)

,然后在Java代码中使用它来派生可用于dropDuplicates的列:

 df
  .selectExpr("*", "scalaHash(canvasHashes) AS hash_of_canvas_hashes")
  .dropDuplicates(
    // All columns excluding canvasHashes / hash_of_canvas_hashes
    "timestamp",  "c1", "c2", ..., "cn" 
    // Hash used as surrogate of canvasHashes
    "hash_of_canvas_hashes"         
  )

与SQL等效

SELECT 
  timestamp, c1, c2, ..., cn,   -- All columns excluding canvasHashes
  first(canvasHashes) AS canvasHashes
FROM df GROUP BY
  timestamp, c1, c2, ..., cn    -- All columns excluding canvasHashes

*请注意,java.util.Map及其hashCode无效,因为hashCode不一致。

答案 1 :(得分:2)

是的,语法不正确,应该是:

Dataset<Row> landingDF = sqlContext.sql("SELECT distinct * from df");

答案 2 :(得分:-1)

1)如果您想基于色域进行区分,可以使用它

CFAlertStyle.BOTTOM_SHEET

2)如果您希望所有列都具有唯一性,请使用dropduplicate

/**
 * @param tagID - ID of the tag
 * @param startDate - Starting Date
 * @param endDate - End date
 * @param estimated <-- this is not the param - should be removed or fix doc
 * @return <-- missing return param and description
 * @throws ServerException -- throws server exception
 */