Question

我创建了一个脚本，将杂志中的大量图像导入数据库。我希望杂志只有独特的图像所以我应该每次检查，如果现在我的数据库中存在图像。这是我的问题 - 如何在短时间内完成，在我的数据库中是~1000000条记录？

我的想法是在每张图片上使用strlen（）：

$image = file_get_contents('http://server.com/imageX.jpg');
$counter = strlen($image);
// $counter => for example: 105188

然后将此号码保存在数据库中并使用INSERT IGNORE INTO：

INSERT IGNORE INTO `database` (`unique_counter`, `img_url`, `img_name`) VALUES (105188, 'http://server.com/imageX.jpg', 'imageX.jpg')

如果要添加此图片 - 一切正常。但我认为这个想法适用于~100张图片。当我有1000000个图像以及更多图像并且这些图像的所有内容都具有相似的尺寸（宽度和高度）时，与我的想法相反，当图像不同时，也可以使用相同的图像。

你能帮忙吗？如何在很短的时间内比较数据库中的许多图像？

感谢。

Answer 1

您应该为这些图像创建一个哈希值，然后将它们存储到数据库中。

您可以使用$hash = md5_file($file_path);获取较小文件的哈希值

如果你有非常大的图像，你可以在不影响内存限制的情况下获得哈希

function get_hash($file_path, $limit = 0, $offset = 0) {

    if (filesize($file_path) < 15728640) { //get hash for less than 15MB images
        // md5_file is always faster if we don't chunk the file
        $hash = md5_file($file_path);

        return $hash !== false ? $hash : null;
    }

    $ctx = hash_init('md5');

    if (!$ctx) {
        // Fail to initialize file hashing
        return null;
    }

    $limit = filesize($file_path) - $offset;

    $handle = @fopen($file_path, "rb");
    if ($handle === false) {
        // Failed opening file, cleanup hash context
        hash_final($ctx);

        return null;
    }

    fseek($handle, $offset);

    while ($limit > 0) {
        // Limit chunk size to either our remaining chunk or max chunk size
        $chunkSize = $limit < 131072 ? $limit : 131072;
        $limit -= $chunkSize;

        $chunk = fread($handle, $chunkSize);
        hash_update($ctx, $chunk);
    }

    fclose($handle);

    return hash_final($ctx);
}

Answer 2

$info = getimagesize('http://server.com/imageX.jpg');

$info['time'] = time();// You can add microtime if needed..

$hash = base64_encode(json_encode($info));

INSERT IGNORE INTO `database` (`hash`, `img_url`, `img_name`) VALUES ($hash, 'http://server.com/imageX.jpg', 'imageX.jpg')

PHP / SQL - 比较很多图像

2 个答案: