如何按键分组的时间戳返回最新记录?

时间:2016-12-14 12:20:54

标签: couchdb cloudant

我的数据集与此类似:

{"user":333,"product":943, "rating":2.025743791177902, "timestamp":1481675659}
{"user":333,"product":3074,"rating":2.1070657532324493,"timestamp":1481675178}
{"user":333,"product":3074,"rating":2.108323259636257, "timestamp":1481673546}
{"user":333,"product":943, "rating":2.0211849667268353,"timestamp":1481675178}
{"user":333,"product":943, "rating":2.041045323231024, "timestamp":1481673546}
{"user":333,"product":119, "rating":2.1832303461543163,"timestamp":1481675659}
{"user":333,"product":119, "rating":2.1937538029700203,"timestamp":1481673546}
{"user":111,"product":123, ...

我想查询用户的所有记录(例如333),但只返回每个产品的最新时间戳。例如。根据上面的数据,查询将返回:

{"user":333,"product":119, "rating":2.1832303461543163,"timestamp":1481675659}     
{"user":333,"product":3074,"rating":2.1070657532324493,"timestamp":1481675178}
{"user":333,"product":943, "rating":2.025743791177902, "timestamp":1481675659}

等效的sql查询看起来像这样的东西:

SELECT * FROM recommendations L
LEFT JOIN recommendations R ON
          L.user = R.user AND
          L.product = R.product AND
          L.timestamp < r.timestamp
WHERE isnull(r.user) and isnull(r.product)

这是否可以使用map / reduce索引?如果是这样,怎么样?如果没有,是否有替代方法,如lucene指数?

理想情况下,我也希望按评级值排序。

1 个答案:

答案 0 :(得分:1)

Cloudant / CouchDB MapReduce可以为复合键生成聚合计数/总和/统计数据,例如

  • 按用户和&amp;组分组的数字条目产品
  • 按用户分组的平均评分&amp;产品

但它无法返回“按用户分组”的“最新评分”产品

基于Lucene的索引也没有多大帮助。它允许允许在时间窗口中选择数据,例如“在时间戳X和属于用户Z的时间戳Y之间获得我的用户评级”,但由于基于Lucene的索引没有聚合功能,因此仍然可以在您的应用中进行工作。

另一种解决方案是将数据导出到DashDB等数据仓库解决方案,并在那里执行聚合SQL查询。