查找重复项的最佳性能算法是什么?

时间:2018-07-26 05:02:19

标签: php mysql codeigniter

让我看看我的代码:

function checkForDuplicates() {            
           $data = $this->input->post();
           $project_id = $data['project_id'];

           $this->db->where('project_id', $project_id);
           $paper = $this->db->get('paper')->result();

           $paper2 = $paper; //duplica o array de papers
           $duplicatesCount = 0;

           foreach($paper as $p){
               $similarity = null;

                foreach($paper2 as $p2){
                    if($p -> status_selection_id !== 4 && $p2 -> status_selection_id !== 4){ 
                        if($p -> paper_id !== $p2 -> paper_id){ 
                            similar_text($p -> title, $p2 -> title, $similarity);

                            if ($similarity > 90) { 
                                $p -> status_selection_id = 4;
                                $this->db->where('paper_id', $p -> paper_id);
                                $this->db->update('paper', $p);
                                $duplicatesCount ++;
                            }
                        }
                    }
                }
            }

            $data = array(
                'duplicatesCount' => $duplicatesCount,
                'message' => 'Duplicates where found!'
            );
            echo json_encode($data);
        }
  1. similar_text需要180秒才能检查1500条记录。
  2. livenshtein需要101秒才能检查1500条记录。
  3. if($ pp1 === $ pp2)需要45秒才能检查1500条记录。

检查重复记录并更改其状态的最快方法是什么?

1 个答案:

答案 0 :(得分:1)

优化通常会降低IO。

在您的情况下,减少SQL查询的数量应该可以缩短处理时间。

如果需要处理大量记录,则应将其拆分为多个块。每个块应包含一批可以放入内存(RAM)的记录。

从数据库中检索您的块。 处理您的块(即循环),并使用数组(即)跟踪需要在数据库中进行的更改。 最后,使用尽可能少的查询批量更新数据库。

       $data = $this->input->post();
       $project_id = $data['project_id'];

       $this->db->where('project_id', $project_id);
       $paper = $this->db->get('paper')->result();

       $paper2 = $paper; //duplica o array de papers
       $duplicatesCount = 0;

       // keep track of updates
       $updates = [];

       foreach($paper as $p){
           $similarity = null;

            foreach($paper2 as $p2){
                if($p -> status_selection_id !== 4 && $p2 -> status_selection_id !== 4){ 
                    if($p -> paper_id !== $p2 -> paper_id){ 
                        similar_text($p -> title, $p2 -> title, $similarity);

                        if ($similarity > 90) { 

                            $updates[] = [
                                'paper_id' => $p -> paper_id,
                                'status_selection_id' => 4,
                            ];

                            $duplicatesCount ++;
                        }
                    }
                }
            }
        }

        if ($duplicatesCount > 0) {
             // here you have to create a big SQL request with all the updates
             // maybe your DB adaptor can do it for you ?
             $query = $this->db->somethingToCreateABulkQuery();
             foreach ($updates as $update) {
                 // stuff 
                 $query->somethingToAddAndUpdate($update);
             }
             $this->db->somethingToExecuteTheQuery($query);

        }