Question

我有一个拥有大约150万公司记录（名称，国家和其他小文本字段）的mysql数据库我想用标记标记相同的记录（例如，如果两个同名的公司在美国，那么我有将字段（match_id）设置为等于表示整数10），同样用于其他匹配。目前它花了很长时间（天）我觉得我没有正确使用MYsql我在下面发布我的代码，有没有更快的方法来做到这一点？

<?php

//Create the table if does not already exist
mysql_query("CREATE TABLE IF NOT EXISTS proj ( 
  id INT(11) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
  company_id text NOT NULL ,
  company_name varchar(40) NOT NULL ,
  company_name_text varchar(33) NOT NULL,
  company_name_metaphone varchar(19) NOT NULL,
  country varchar(20) NOT NULL ,
  file_id int(2) NOT NULL ,
  thompson_id varchar(11) NOT NULL ,
  match_no int(7) NOT NULL ,
  INDEX(company_name_text))") 
  or die ("Couldn't create the table: " . mysql_error());


//********Real script starts********
$countries_searched = array(); //To save record ids already flagged (save time)
$counter = 1; //Flag

//Since the company_names which are same are going to be from the same country so I    get all the countries first in the below query and then in the next get all the   companies in that country
$sql = "SELECT DISTINCT country FROM proj WHERE country='Canada'";
$result = mysql_query($sql) or die(mysql_error());

while($resultrow = mysql_fetch_assoc($result)) {
  $country = $resultrow['country'];
  $res = mysql_query("SELECT company_name_metaphone, id, company_name_text 
  FROM proj 
  WHERE country='$country' 
  ORDER BY id") or die (mysql_error());


  //Loop through the company records 
  while ($row = mysql_fetch_array($res, MYSQL_NUM)) {

  //If record id is already flagged (matched and saved in the countries searched      array) don't waste time doing anything    
    if ( in_array($row[1], $countries_searched) ) {
      continue;
    }

    if (strlen($row[0]) > 9) {
      $row[0] = substr($row[0],0,9);
      $query = mysql_query("SELECT id FROM proj 
        WHERE country='$country' 
        AND company_name_metaphone LIKE '$row[0]%' 
        AND id<>'$row[1]'") or die (mysql_error());

      while ($id = mysql_fetch_array($query, MYSQL_NUM)) {
        if (!in_array($id[0], $countries_searched)) $countries_searched[] = $id[0];
      }
      if(mysql_num_rows($query) > 0) {

        mysql_query("UPDATE proj SET match_no='$counter' 
                    WHERE country='$country' 
                    AND company_name_metaphone LIKE '$row[0]%'") 
          or die (mysql_error()." ".mysql_errno());
        $counter++;
      }
    }
    else if(strlen($row[0]) > 3) {
      $query = mysql_query("SELECT id FROM proj WHERE country='$country' 
               AND company_name_text='$row[2]' AND id<>'$row[1]'") 
        or die (mysql_error());
      while ($id = mysql_fetch_array($query, MYSQL_NUM)) {
        if (!in_array($id[0], $countries_searched)) $countries_searched[] = $id[0];
      }
      if(mysql_num_rows($query) > 0) {
        mysql_query("UPDATE proj SET match_no='$counter' 
                    WHERE country='$country' 
                    AND company_name_text='$row[2]'") or die (mysql_error());
        $counter++;
      }
    }   
  }
}
?>

Answer 1

我会选择纯粹的SQL解决方案，例如：

SELECT 
    GROUP_CONCAT(id SEPARATOR ' '), "name"
FROM proj 
WHERE 
    LENGTH(company_name_metaphone) < 9 AND 
    LENGTH(company_name_metaphone) > 3
GROUP BY country, UPPER(company_name_text)
HAVING COUNT(*) > 1
UNION
SELECT 
    GROUP_CONCAT(id SEPARATOR ' '), "metaphone"
FROM proj 
WHERE 
    LENGTH(company_name_metaphone) > 9
GROUP BY country, LEFT(company_name_metaphone, 9)
HAVING COUNT(*) > 1

然后遍历此结果以更新ID。

Answer 2

我不确定你要做的是什么，但我在你的代码中看到的是你在数组中搜索了大量数据，我认为你的问题是你的PHP代码而不是SQL语句。

Answer 3

您需要按字段调整组以满足您的匹配要求

如果你的脚本超时（很可能是由于大量数据），set_time_limit（0）否则你也可以在$ sql中添加1000或者其他东西的限制，并且多次运行脚本，因为where子句将排除已处理的任何匹配的行（但不会在调用之间跟踪$ match_no，所以你需要自己处理）

// find all companies that have multiple rows grouped by identifying fields

$sql = "select company_name, country, COUNT(*) as num_matches from proj 
where match_no = 0
group by company_name, country 
having num_matches > 1";

$res = mysql_query($sql);

$match_no = 1;

// loop through all duplicate companies, and set match_id
while ($row = mysql_fetch_assoc($res)) {

  $company_name = mysql_escape_string($row['company_name']);
  $country = mysql_escape_string($row['country']);

   $sql = "update proj set match_no = $match_no where 
       company_name = '$company_name', country = '$country';

     mysql_query($sql);

     $match_no++;
}

MYSQL匹配文本字段

3 个答案: