使用spark-mongo连接器在Oid上过滤

时间:2017-08-31 14:54:24

标签: mongodb scala apache-spark

我想从spark程序中过滤mongo文档的objectId。我尝试过以下方法:

case class _id(oid: String)

val str_start: _id = _id((start.getMillis() / 1000).toHexString + "0000000000000000")
  val str_end: _id = _id((end.getMillis() / 1000).toHexString + "0000000000000000")

val filteredDF = df.filter(
    $"_timestamp".isNotNull
      .and($"_timestamp".between(new Timestamp(start.getMillis()), new Timestamp(end.getMillis())))
      .and($"_id").between(str_start, str_end)

 val str_start = (start.getMillis() / 1000).toHexString + "0000000000000000"
  val str_end = (end.getMillis() / 1000).toHexString + "0000000000000000"

val filteredDF = df.filter(
    $"_timestamp".isNotNull
      .and($"_timestamp".between(new Timestamp(start.getMillis()), new Timestamp(end.getMillis())))
      .and($"_id.oid").between(str_start, str_end)

两者都给我分析错误:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
resolve '(((`_timestamp` IS NOT NULL) AND ((`_timestamp` >= TIMESTAMP('2017-
07-31 00:22:00.0')) AND (`_timestamp` <= TIMESTAMP('2017-08-01 
00:22:00.0')))) AND `_id`)' due to data type mismatch: differing types in 
'(((`_timestamp` IS NOT NULL) AND ((`_timestamp` >= TIMESTAMP('2017-07-31 
00:22:00.0')) AND (`_timestamp` <= TIMESTAMP('2017-08-01 00:22:00.0')))) AND 
`_id`)' (boolean and struct<oid:string>).;;
'Filter (((((isnotnull(_timestamp#40) && ((_timestamp#40 >= 
1501449720000000) && (_timestamp#40 <= 1501536120000000))) && _id#38) >= 
597e4df80000000000000000) && (((isnotnull(_timestamp#40) && ((_timestamp#40 
>= 1501449720000000) && (_timestamp#40 <= 1501536120000000))) && _id#38) <= 
597f9f780000000000000000))

如何查询oid?

由于 尼尔

1 个答案:

答案 0 :(得分:0)

我认为你是错位的括号:应该像

and($"_id.oid" between(str_start, str_end) )

(这就是您收到错误消息的原因:

(boolean and struct<oid:string>)