Question

我有一个拥有28k用户，6万个位置和1m评论的数据集。我正在实施一个推荐系统，考虑到用户最终要达到的一些常见位置和常用费率，以及一些用户想要去某个位置的预测。

以下是我如何做的代码

HashMap<String, HashMap<String, Double>> user_locIDVisitsPredictions = new HashMap<>();
HashMap<String, HashMap<String, Double>> user_locIDRatesPredictions = new HashMap<>();          


List<Future> tasks1 = new ArrayList<>();
ExecutorService executor1 = Executors.newFixedThreadPool(threads);
for(String me : wholeSetHistory.keySet()){
    Runnable tokentask = new UserRun(wholeSetHistory, wholeSetRatings, lnglatStores2, user_locIDVisitsPredictions, user_locIDRatesPredictions, me, u);
    u++;
    tasks1.add(executor1.submit(tokentask));
}
executor1.shutdown();
boolean done1=false;
while(done1==false) {
    done1=true;
    for (int i=0; i<tasks1.size(); i++){
        try{
            Future future =tasks1.get(i);
            if(future.get()!=null){
                done1=false;
                break;
            }
        }catch(Exception e){
            System.out.println("sto future kollise ");
        }
    }
}
tasks1.clear();

Runnable tokentask正在为一个用户实现该过程以获取结果。我正在使用线程，因为我在一台机器上运行实验，它在Linux操作系统上运行。我用nohub运行它。

现在我的问题。这个过程非常顺利，直到达到25k用户为止。最后3k用户正在考虑为他们计算结果。

有关算法如何运作的更多细节。

表示目标用户
对于目标用户附近的每个其他用户

2.1获取他曾经去过的位置并与目标用户进行比较

2.2获得他已完成的费率并与目标用户进行比较

2.3使相似之处

2.4进行预测

有什么想法为什么这个过程在25k用户之后变得非常缓慢？

感谢您的时间！

Answer 1

我会使用更少的任务。我猜测wholeSetHistory是1m集合，但是每个CPU只需要1到2个任务。幸运的是，有一个内置的库可以帮到你。

List<Result> results = wholeSetHistory.entrySet().parallelStream(e ->
    new UserRun(e.getValue(), wholeSetRatings, lnglatStores2, 
                user_locIDVisitsPredictions, user_locIDRatesPredictions, e.getKey())
    .result())
    .collect(Collectors.toList());

这将大大减少您一次创建的对象数量，并且代码更加简单。您也没有忙于等待线程刻录CPU。

25k用户之后的大数据进程堆栈

1 个答案: