根据出现次数过滤RDD

时间:2017-02-14 16:03:02

标签: scala apache-spark rdd apache-spark-mllib

我使用MLlib Rating对象获得了产品评级的RDD,这只是(int userId,int productId,double rating)的元组。我想删除RDD中的任何元素,这是对评分太少的产品的评论。

例如,RDD可能是这样的:

class AddNewRecipeTableViewController: UITableViewController, UIViewControllerTransitioningDelegate {
    @IBAction func popUpTest(_ sender: Any) {
        let storyboard = UIStoryboard(name: "Main", bundle: nil)
        let pvc = storyboard.instantiateViewController(withIdentifier: "popUpTest") as! UINavigationController

        pvc.modalPresentationStyle = UIModalPresentationStyle.custom
        pvc.transitioningDelegate = self
        self.present(pvc, animated: true, completion: nil)
    }

    func presentationControllerForPresentedViewController(presented: UIViewController, presentingViewController presenting: UIViewController!, sourceViewController source: UIViewController) -> UIPresentationController? {
        return HalfSizePresentationController(presentedViewController: presented, presenting: presentingViewController)
    }
}

class HalfSizePresentationController : UIPresentationController {
    override var frameOfPresentedViewInContainerView : CGRect {
        return CGRect(x: 0, y: 0, width: containerView!.bounds.width, height: containerView!.bounds.height/2)
    }
}

如果我过滤掉了那些少于2条评论的产品,那么它只会过滤掉最后一个评级并返回前四个评分。 (我希望以最低审核次数超过2的方式过滤,但仅举例来说。)

目前我有这个代码按照评级数量的顺序输出一系列产品ID,但我不确定是否可以根据这种方式从主RDD中过滤掉,而且无论如何它看起来效率都很低:

Rating(35, 1, 5.0)
Rating(18, 1, 4.0)
Rating(29, 2, 3.0)
Rating(12, 2, 2.0)
Rating(65, 3, 1.0)

1 个答案:

答案 0 :(得分:1)

您可以通过 ProductId rdd 进行分组,然后根据组的长度是否大于阈值(此处为1)对其进行过滤。使用 flatMap 从分组的 rdd 中提取结果:

case class Rating(UserId: Int, ProductId: Int, Rating: Double)

val ratings = sc.parallelize(Seq(Rating(35, 1, 5.0),
    Rating(18, 1, 4.0),
    Rating(29, 2, 3.0),
    Rating(12, 2, 2.0),
    Rating(65, 3, 1.0)))

val prodMinCounts = ratings.groupBy(_.ProductId).
                            filter(_._2.toSeq.length > 1).
                            flatMap(_._2)
prodMinCounts.collect
// res14: Array[Rating] = Array(Rating(35,1,5.0), Rating(18,1,4.0), Rating(29,2,3.0), Rating(12,2,2.0))