unit
id fir_name sec_name
author
id name unit_id
author_paper
id author_id paper_id
我想统一作者['同一作者'意味着名称相同且单位'fir_names相同],我必须同时更改author_paper表。
这就是我的所作所为:
$conn->do('create index author_name on author (name)');
my $sqr = $conn->prepare("select name from author group by name having count(*) > 1");
$sqr->execute();
while(my @row = $sqr->fetchrow_array()) {
my $dup_name = $row[0];
$dup_name = formatHtml($dup_name);
my $sqr2 = $conn->prepare("select id, unit_id from author where name = '$dup_name'");
$sqr2->execute();
my %fir_name_hash = ();
while(my @row2 = $sqr2->fetchrow_array()) {
my $author_id = $row2[0];
my $unit_id = $row2[1];
my $fir_name = getFirNameInUnit($conn, $unit_id);
if (not exists $fir_name_hash{$fir_name}) {
$fir_name_hash{$fir_name} = []; #anonymous arr reference
}
$x = $fir_name_hash{$fir_name};
push @$x, $author_id;
}
while(my ($fir_name, $author_id_arr) = each(%fir_name_hash)) {
my $count = scalar @$author_id_arr;
if ($count == 1) {next;}
my $author_id = $author_id_arr->[0];
for ($i = 1; $i < $count; $i++) {
#print "$author_id_arr->[$i] => $author_id\n";
unifyAuthorAndAuthorPaperTable($conn, $author_id, $author_id_arr->[$i]); #just delete in author table, and update in author_paper table
}
}
}
从作者中选择计数(*); #240000 从作者中选择count(distinct(name)); #7,7000 它非常慢!!我已经运行了5个小时,它只删除了大约4,0000个重复名称。 如何让它运行得更快。我渴望得到你的建议
答案 0 :(得分:8)
您不应在循环中准备第二个sql语句,并且在使用?
占位符时可以实际使用该准备工作:
$conn->do('create index author_name on author (name)');
my $sqr = $conn->prepare('select name from author group by name having count(*) > 1');
# ? is the placeholder and the database driver knows if its an integer or a string and
# quotes the input if needed.
my $sqr2 = $conn->prepare('select id, unit_id from author where name = ?');
$sqr->execute();
while(my @row = $sqr->fetchrow_array()) {
my $dup_name = $row[0];
$dup_name = formatHtml($dup_name);
# Now you can reuse the prepared handle with different input
$sqr2->execute( $dup_name );
my %fir_name_hash = ();
while(my @row2 = $sqr2->fetchrow_array()) {
my $author_id = $row2[0];
my $unit_id = $row2[1];
my $fir_name = getFirNameInUnit($conn, $unit_id);
if (not exists $fir_name_hash{$fir_name}) {
$fir_name_hash{$fir_name} = []; #anonymous arr reference
}
$x = $fir_name_hash{$fir_name};
push @$x, $author_id;
}
while(my ($fir_name, $author_id_arr) = each(%fir_name_hash)) {
my $count = scalar @$author_id_arr;
if ($count == 1) {next;}
my $author_id = $author_id_arr->[0];
for ($i = 1; $i < $count; $i++) {
#print "$author_id_arr->[$i] => $author_id\n";
unifyAuthorAndAuthorPaperTable($conn, $author_id, $author_id_arr->[$i]); #just delete in author table, and update in author_paper table
}
}
}
这也可以加快速度。
答案 1 :(得分:5)
当我看到一个查询和一个循环时,我认为你有一个延迟问题:你查询得到一组值,然后迭代集合做其他事情。如果这意味着集合中每行的数据库往返,那就是很多延迟。
如果您可以使用UPDATE和子选择在单个查询中执行此操作会更好,如果您可以批量处理这些请求并在一次往返中执行所有请求。
如果您明智地使用索引,您将获得额外的加速。 WHERE子句中的每一列都应该有一个索引。每个外键都应该有一个索引。
我会在你的查询上运行EXPLAIN PLAN,看看是否有任何TABLE SCAN正在进行。如果有,你必须正确索引。
我想知道一个设计合理的JOIN是否会来救你?
一个表中240,000行,另一个表中77,000行 大数据库。