我的数据集与此类似:
{"user":333,"product":943, "rating":2.025743791177902, "timestamp":1481675659}
{"user":333,"product":3074,"rating":2.1070657532324493,"timestamp":1481675178}
{"user":333,"product":3074,"rating":2.108323259636257, "timestamp":1481673546}
{"user":333,"product":943, "rating":2.0211849667268353,"timestamp":1481675178}
{"user":333,"product":943, "rating":2.041045323231024, "timestamp":1481673546}
{"user":333,"product":119, "rating":2.1832303461543163,"timestamp":1481675659}
{"user":333,"product":119, "rating":2.1937538029700203,"timestamp":1481673546}
{"user":111,"product":123, ...
我想查询用户的所有记录(例如333),但只返回每个产品的最新时间戳。例如。根据上面的数据,查询将返回:
{"user":333,"product":119, "rating":2.1832303461543163,"timestamp":1481675659}
{"user":333,"product":3074,"rating":2.1070657532324493,"timestamp":1481675178}
{"user":333,"product":943, "rating":2.025743791177902, "timestamp":1481675659}
等效的sql查询看起来像这样的东西:
SELECT * FROM recommendations L
LEFT JOIN recommendations R ON
L.user = R.user AND
L.product = R.product AND
L.timestamp < r.timestamp
WHERE isnull(r.user) and isnull(r.product)
这是否可以使用map / reduce索引?如果是这样,怎么样?如果没有,是否有替代方法,如lucene指数?
理想情况下,我也希望按评级值排序。
答案 0 :(得分:1)
Cloudant / CouchDB MapReduce可以为复合键生成聚合计数/总和/统计数据,例如
但它无法返回“按用户分组”的“最新评分”产品
基于Lucene的索引也没有多大帮助。它允许允许在时间窗口中选择数据,例如“在时间戳X和属于用户Z的时间戳Y之间获得我的用户评级”,但由于基于Lucene的索引没有聚合功能,因此仍然可以在您的应用中进行工作。
另一种解决方案是将数据导出到DashDB等数据仓库解决方案,并在那里执行聚合SQL查询。