我使用MLlib Rating对象获得了产品评级的RDD,这只是(int userId,int productId,double rating)的元组。我想删除RDD中的任何元素,这是对评分太少的产品的评论。
例如,RDD可能是这样的:
class AddNewRecipeTableViewController: UITableViewController, UIViewControllerTransitioningDelegate {
@IBAction func popUpTest(_ sender: Any) {
let storyboard = UIStoryboard(name: "Main", bundle: nil)
let pvc = storyboard.instantiateViewController(withIdentifier: "popUpTest") as! UINavigationController
pvc.modalPresentationStyle = UIModalPresentationStyle.custom
pvc.transitioningDelegate = self
self.present(pvc, animated: true, completion: nil)
}
func presentationControllerForPresentedViewController(presented: UIViewController, presentingViewController presenting: UIViewController!, sourceViewController source: UIViewController) -> UIPresentationController? {
return HalfSizePresentationController(presentedViewController: presented, presenting: presentingViewController)
}
}
class HalfSizePresentationController : UIPresentationController {
override var frameOfPresentedViewInContainerView : CGRect {
return CGRect(x: 0, y: 0, width: containerView!.bounds.width, height: containerView!.bounds.height/2)
}
}
如果我过滤掉了那些少于2条评论的产品,那么它只会过滤掉最后一个评级并返回前四个评分。 (我希望以最低审核次数超过2的方式过滤,但仅举例来说。)
目前我有这个代码按照评级数量的顺序输出一系列产品ID,但我不确定是否可以根据这种方式从主RDD中过滤掉,而且无论如何它看起来效率都很低:
Rating(35, 1, 5.0)
Rating(18, 1, 4.0)
Rating(29, 2, 3.0)
Rating(12, 2, 2.0)
Rating(65, 3, 1.0)
答案 0 :(得分:1)
您可以通过 ProductId 对 rdd 进行分组,然后根据组的长度是否大于阈值(此处为1)对其进行过滤。使用 flatMap 从分组的 rdd 中提取结果:
case class Rating(UserId: Int, ProductId: Int, Rating: Double)
val ratings = sc.parallelize(Seq(Rating(35, 1, 5.0),
Rating(18, 1, 4.0),
Rating(29, 2, 3.0),
Rating(12, 2, 2.0),
Rating(65, 3, 1.0)))
val prodMinCounts = ratings.groupBy(_.ProductId).
filter(_._2.toSeq.length > 1).
flatMap(_._2)
prodMinCounts.collect
// res14: Array[Rating] = Array(Rating(35,1,5.0), Rating(18,1,4.0), Rating(29,2,3.0), Rating(12,2,2.0))