基于键值相似性对数组进行分组

时间:2018-01-09 18:54:52

标签: php arrays laravel

假设我有一个这样的数组:

$data[0]['name'] = 'product 1 brandX';
$data[0]['id_product'] = '77777777';
$data[1]['name'] = 'brandX product 1';
$data[1]['id_product'] = '77777777';
$data[2]['name'] = 'brandX product 1 RED';
$data[2]['id_product'] = '77777777';
$data[3]['name'] = 'product 1 brandX';
$data[3]['id_product'] = '';
$data[4]['name'] = 'product 2 brandY';
$data[4]['id_product'] = '8888888';
$data[5]['name'] = 'product 2 brandY RED';
$data[5]['id_product'] = '';

我试图按照它们的相似性(name或id_product)对它们进行分组。

这将是预期的最终数组:

$uniques[0]['name'] = 'product 1 brandX'; //The smallest name for the product
$uniques[0]['count'] = 4; //Entry which has all the words of the smallest name or the same id_product
$uniques[0]['name'] = 'product 2 brandY';
$uniques[0]['count'] = 2;

这就是我到目前为止所做的:

foreach ($data as $t) {
    if (!isset($uniques[$t['id_product']]['name']) || mb_strlen($uniques[$t['id_product']]['name']) > mb_strlen($t['name'])) {
        $uniques[$t['id_product']]['name'] = $t['name'];
        $uniques[$t['id_product']]['count']++;
    }
}

但是我不能基于id_product,因为有时候它会是同一个产品但是一个会有id而另一个会没有。我也必须检查名称,但无法完成它。

3 个答案:

答案 0 :(得分:0)

我认为这不会解决您的问题,但可能会让您再次前进

    $data = [];

    $data[0]['name']       = 'product 1 brandX';
    $data[0]['id_product'] = '77777777';
    $data[1]['name']       = 'brandX product 1';
    $data[1]['id_product'] = '77777777';
    $data[2]['name']       = 'brandX product 1 RED';
    $data[2]['id_product'] = '77777777';
    $data[3]['name']       = 'product 1 brandX';
    $data[3]['id_product'] = '';
    $data[4]['name']       = 'product 2 brandY';
    $data[4]['id_product'] = '8888888';
    $data[5]['name']       = 'product 2 brandY RED';
    $data[5]['id_product'] = '';

    $data = collect($data);

    $tallies = [
        'brand_x' => 0,
        'brand_y' => 0,
        'other'   => 0
    ];

    $unique = $data->unique(function ($item) use (&$tallies){
        switch(true){
            case(strpos($item['name'], 'brandX') !== false):
                $tallies['brand_x']++;

                return 'product X';
                break;

            case(strpos($item['name'], 'brandY') !== false):
                $tallies['brand_y']++;

                return 'product Y';
                break;

            default:
                $tallies['other']++;

                return 'other';
                break;
        }
    });


    print_r($unique);
    print_r($tallies);

答案 1 :(得分:0)

我认为解决此问题的最佳方法是使用唯一的product_id,但如果您想通过在名称字段中查找相似性来创建唯一键,则可以使用preg_split将名称转换为数组,然后使用array_diff查找差异数组。如果2个名称的差值小于2,则认为这两个名称是唯一的。我创建此函数,如果找不到,则会在$arrfalse中返回相似的名称:

function get_similare_key($arr, $name) {

    $names = preg_split("/\s+/", $name); 

    // get similaire key from $arr
    foreach( $arr as $key => $value ) {

        $key_names = preg_split("/\s+/", $key); 
        $diff = array_diff($key_names, $names); 
        if ( count($diff) <= 1 ) { 
            return $key;
        }

    }

    return false;

}

这是一个有效的演示here

答案 2 :(得分:0)

我的答案基于关于如何对产品进行分组的两个假设:

  1. 虽然id_product可能会丢失,但它存在的地方却是。{ 正确且足以匹配两种产品;以及

  2. 要匹配两个产品名称,最长的name(名称最多 单词)必须包含最短name中的所有单词(名称带有 最少的单词)。

  3. 鉴于这些假设,这里有一个函数来确定两个单独的产品是否匹配(产品应该组合在一起)和一个辅助函数来从名称中获取单词:

    function productsMatch(array $product1, array $product2)
    {
        if (
            !empty($product1['id_product'])
            && !empty($product2['id_product'])
            && $product1['id_product'] === $product2['id_product']
        ) {
            // match based on id_product
            return true;
        }
        $words1 = getWordsFromProduct($product1);
        $words2 = getWordsFromProduct($product2);
        $min_word_count = min(count($words1), count($words2));
        $match_word_count = count(array_intersect_key($words1, $words2));
        if ($min_word_count >= 1 && $match_word_count === $min_word_count) {
            // match based on name similarity
            return true;
        }
        // no match
        return false;
    }
    
    function getWordsFromProduct(array $product)
    {
        $name = mb_strtolower($product['name']);
        preg_match_all('/\S+/', $name, $matches);
        $words = array_flip($matches[0]);
        return $words;
    }
    

    此功能可用于对产品进行分组:

    function groupProducts(array $data)
    {
        $groups = array();
        foreach ($data as $product1) {
            foreach ($groups as $key => $products) {
                foreach ($products as $product2) {
                    if (productsMatch($product1, $product2)) {
                        $groups[$key][] = $product1;
                        continue 3; // foreach ($data as $product1)
    
                    }
                }
            }
            $groups[] = array($product1);
        }
        return $groups;
    }
    

    然后,此函数可用于提取每个组的最短名称和计数:

    function uniqueProducts(array $groups)
    {
        $uniques = array();
        foreach ($groups as $products) {
            $shortest_name = '';
            $shortest_length = PHP_INT_MAX;
            $count = 0;
            foreach ($products as $product) {
                $length = mb_strlen($product['name']);
                if ($length < $shortest_length) {
                    $shortest_name = $product['name'];
                    $shortest_length = $length;
                }
                $count++;
            }
            $uniques[] = array(
                'name' => $shortest_name,
                'count' => $count,
            );
        }
        return $uniques;
    }
    

    因此,结合所有4个函数,您可以获得如下的uniques(使用php 5.6测试):

    $data[0]['name'] = 'product 1 brandX';
    $data[0]['id_product'] = '77777777';
    $data[1]['name'] = 'brandX product 1';
    $data[1]['id_product'] = '77777777';
    $data[2]['name'] = 'brandX product 1 RED';
    $data[2]['id_product'] = '77777777';
    $data[3]['name'] = 'product 1 brandX';
    $data[3]['id_product'] = '';
    $data[4]['name'] = 'product 2 brandY';
    $data[4]['id_product'] = '8888888';
    $data[5]['name'] = 'product 2 brandY RED';
    $data[5]['id_product'] = '';
    
    $groups = groupProducts($data);
    $uniques = uniqueProducts($groups);
    var_dump($uniques); 
    

    提供输出:

    array(2) {
      [0]=>
      array(2) {
        ["name"]=>
        string(16) "product 1 brandX"
        ["count"]=>
        int(4)
      }
      [1]=>
      array(2) {
        ["name"]=>
        string(16) "product 2 brandY"
        ["count"]=>
        int(2)
      }
    }