我有一个拥有大约150万公司记录(名称,国家和其他小文本字段)的mysql数据库我想用标记标记相同的记录(例如,如果两个同名的公司在美国,那么我有将字段(match_id)设置为等于表示整数10),同样用于其他匹配。目前它花了很长时间(天)我觉得我没有正确使用MYsql我在下面发布我的代码,有没有更快的方法来做到这一点?
<?php
//Create the table if does not already exist
mysql_query("CREATE TABLE IF NOT EXISTS proj (
id INT(11) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
company_id text NOT NULL ,
company_name varchar(40) NOT NULL ,
company_name_text varchar(33) NOT NULL,
company_name_metaphone varchar(19) NOT NULL,
country varchar(20) NOT NULL ,
file_id int(2) NOT NULL ,
thompson_id varchar(11) NOT NULL ,
match_no int(7) NOT NULL ,
INDEX(company_name_text))")
or die ("Couldn't create the table: " . mysql_error());
//********Real script starts********
$countries_searched = array(); //To save record ids already flagged (save time)
$counter = 1; //Flag
//Since the company_names which are same are going to be from the same country so I get all the countries first in the below query and then in the next get all the companies in that country
$sql = "SELECT DISTINCT country FROM proj WHERE country='Canada'";
$result = mysql_query($sql) or die(mysql_error());
while($resultrow = mysql_fetch_assoc($result)) {
$country = $resultrow['country'];
$res = mysql_query("SELECT company_name_metaphone, id, company_name_text
FROM proj
WHERE country='$country'
ORDER BY id") or die (mysql_error());
//Loop through the company records
while ($row = mysql_fetch_array($res, MYSQL_NUM)) {
//If record id is already flagged (matched and saved in the countries searched array) don't waste time doing anything
if ( in_array($row[1], $countries_searched) ) {
continue;
}
if (strlen($row[0]) > 9) {
$row[0] = substr($row[0],0,9);
$query = mysql_query("SELECT id FROM proj
WHERE country='$country'
AND company_name_metaphone LIKE '$row[0]%'
AND id<>'$row[1]'") or die (mysql_error());
while ($id = mysql_fetch_array($query, MYSQL_NUM)) {
if (!in_array($id[0], $countries_searched)) $countries_searched[] = $id[0];
}
if(mysql_num_rows($query) > 0) {
mysql_query("UPDATE proj SET match_no='$counter'
WHERE country='$country'
AND company_name_metaphone LIKE '$row[0]%'")
or die (mysql_error()." ".mysql_errno());
$counter++;
}
}
else if(strlen($row[0]) > 3) {
$query = mysql_query("SELECT id FROM proj WHERE country='$country'
AND company_name_text='$row[2]' AND id<>'$row[1]'")
or die (mysql_error());
while ($id = mysql_fetch_array($query, MYSQL_NUM)) {
if (!in_array($id[0], $countries_searched)) $countries_searched[] = $id[0];
}
if(mysql_num_rows($query) > 0) {
mysql_query("UPDATE proj SET match_no='$counter'
WHERE country='$country'
AND company_name_text='$row[2]'") or die (mysql_error());
$counter++;
}
}
}
}
?>
答案 0 :(得分:1)
我会选择纯粹的SQL解决方案,例如:
SELECT
GROUP_CONCAT(id SEPARATOR ' '), "name"
FROM proj
WHERE
LENGTH(company_name_metaphone) < 9 AND
LENGTH(company_name_metaphone) > 3
GROUP BY country, UPPER(company_name_text)
HAVING COUNT(*) > 1
UNION
SELECT
GROUP_CONCAT(id SEPARATOR ' '), "metaphone"
FROM proj
WHERE
LENGTH(company_name_metaphone) > 9
GROUP BY country, LEFT(company_name_metaphone, 9)
HAVING COUNT(*) > 1
然后遍历此结果以更新ID。
答案 1 :(得分:0)
我不确定你要做的是什么,但我在你的代码中看到的是你在数组中搜索了大量数据,我认为你的问题是你的PHP代码而不是SQL语句。
答案 2 :(得分:0)
您需要按字段调整组以满足您的匹配要求
如果你的脚本超时(很可能是由于大量数据),set_time_limit(0) 否则你也可以在$ sql中添加1000或者其他东西的限制,并且多次运行脚本,因为where子句将排除已处理的任何匹配的行(但不会在调用之间跟踪$ match_no,所以你需要自己处理)
// find all companies that have multiple rows grouped by identifying fields
$sql = "select company_name, country, COUNT(*) as num_matches from proj
where match_no = 0
group by company_name, country
having num_matches > 1";
$res = mysql_query($sql);
$match_no = 1;
// loop through all duplicate companies, and set match_id
while ($row = mysql_fetch_assoc($res)) {
$company_name = mysql_escape_string($row['company_name']);
$country = mysql_escape_string($row['country']);
$sql = "update proj set match_no = $match_no where
company_name = '$company_name', country = '$country';
mysql_query($sql);
$match_no++;
}