K-means聚类:怎么了? (PHP)

时间:2009-08-23 18:39:48

标签: php arrays cluster-analysis distance k-means

我一直在寻找一种计算足球经理比赛中动态市场价值的方法。 I asked this question here and got a very good answer from Alceu Costa.

我尝试编写此算法(90个元素,5个群集),但它无法正常工作:

  1. 在第一次迭代中,大部分元素会更改其群集。
  2. 从第二次迭代开始,所有元素都会更改其群集。
  3. 由于算法通常一直工作直到收敛(没有元素改变它的簇),所以在我的情况下它没有完成。
  4. 所以我手动将结束设置为第15次迭代。你可以看到它无限运行。
  5. You can see the output of my algorithm here.这有什么问题?你能告诉我为什么它不能正常工作吗?

    我希望你能帮助我。非常感谢你提前!

    以下是代码:

    <?php
    include 'zzserver.php';
    function distance($player1, $player2) {
        global $strengthMax, $maxStrengthMax, $motivationMax, $ageMax;
        // $playerX = array(strength, maxStrength, motivation, age, id);
        $distance = 0;
        $distance += abs($player1['strength']-$player2['strength'])/$strengthMax;
        $distance += abs($player1['maxStrength']-$player2['maxStrength'])/$maxStrengthMax;
        $distance += abs($player1['motivation']-$player2['motivation'])/$motivationMax;
        $distance += abs($player1['age']-$player2['age'])/$ageMax;
        return $distance;
    }
    function calculateCentroids() {
        global $cluster;
        $clusterCentroids = array();
        foreach ($cluster as $key=>$value) {
            $strenthValues = array();
            $maxStrenthValues = array();
            $motivationValues = array();
            $ageValues = array();
            foreach ($value as $clusterEntries) {
                $strenthValues[] = $clusterEntries['strength'];
                $maxStrenthValues[] = $clusterEntries['maxStrength'];
                $motivationValues[] = $clusterEntries['motivation'];
                $ageValues[] = $clusterEntries['age'];
            }
            if (count($strenthValues) == 0) { $strenthValues[] = 0; }
            if (count($maxStrenthValues) == 0) { $maxStrenthValues[] = 0; }
            if (count($motivationValues) == 0) { $motivationValues[] = 0; }
            if (count($ageValues) == 0) { $ageValues[] = 0; }
            $clusterCentroids[$key] = array('strength'=>array_sum($strenthValues)/count($strenthValues), 'maxStrength'=>array_sum($maxStrenthValues)/count($maxStrenthValues), 'motivation'=>array_sum($motivationValues)/count($motivationValues), 'age'=>array_sum($ageValues)/count($ageValues));
        }
        return $clusterCentroids;
    }
    function assignPlayersToNearestCluster() {
        global $cluster, $clusterCentroids;
        $playersWhoChangedClusters = 0;
        // BUILD NEW CLUSTER ARRAY WHICH ALL PLAYERS GO IN THEN START
        $alte_cluster = array_keys($cluster);
        $neuesClusterArray = array();
        foreach ($alte_cluster as $alte_cluster_entry) {
            $neuesClusterArray[$alte_cluster_entry] = array();
        }
        // BUILD NEW CLUSTER ARRAY WHICH ALL PLAYERS GO IN THEN END
        foreach ($cluster as $oldCluster=>$clusterValues) {
            // FOR EVERY SINGLE PLAYER START
            foreach ($clusterValues as $player) {
                // MEASURE DISTANCE TO ALL CENTROIDS START
                $abstaende = array();
                foreach ($clusterCentroids as $CentroidId=>$centroidValues) {
                    $distancePlayerCluster = distance($player, $centroidValues);
                    $abstaende[$CentroidId] = $distancePlayerCluster;
                }
                arsort($abstaende);
                if ($neuesCluster = each($abstaende)) {
                    $neuesClusterArray[$neuesCluster['key']][] = $player; // add to new array
                    // player $player['id'] goes to cluster $neuesCluster['key'] since it is the nearest one
                    if ($neuesCluster['key'] != $oldCluster) {
                        $playersWhoChangedClusters++;
                    }
                }
                // MEASURE DISTANCE TO ALL CENTROIDS END
            }
            // FOR EVERY SINGLE PLAYER END
        }
        $cluster = $neuesClusterArray;
        return $playersWhoChangedClusters;
    }
    // CREATE k CLUSTERS START
    $k = 5; // Anzahl Cluster
    $cluster = array();
    for ($i = 0; $i < $k; $i++) {
        $cluster[$i] = array();
    }
    // CREATE k CLUSTERS END
    // PUT PLAYERS IN RANDOM CLUSTERS START
    $sql1 = "SELECT ids, staerke, talent, trainingseifer, wiealt FROM ".$prefix."spieler LIMIT 0, 90";
    $sql2 = mysql_abfrage($sql1);
    $anzahlSpieler = mysql_num_rows($sql2);
    $anzahlSpielerProCluster = $anzahlSpieler/$k;
    $strengthMax = 0;
    $maxStrengthMax = 0;
    $motivationMax = 0;
    $ageMax = 0;
    $counter = 0; // for $anzahlSpielerProCluster so that all clusters get the same number of players
    while ($sql3 = mysql_fetch_assoc($sql2)) {
        $assignedCluster = floor($counter/$anzahlSpielerProCluster);
        $cluster[$assignedCluster][] = array('strength'=>$sql3['staerke'], 'maxStrength'=>$sql3['talent'], 'motivation'=>$sql3['trainingseifer'], 'age'=>$sql3['wiealt'], 'id'=>$sql3['ids']);
        if ($sql3['staerke'] > $strengthMax) { $strengthMax = $sql3['staerke']; }
        if ($sql3['talent'] > $maxStrengthMax) { $maxStrengthMax = $sql3['talent']; }
        if ($sql3['trainingseifer'] > $motivationMax) { $motivationMax = $sql3['trainingseifer']; }
        if ($sql3['wiealt'] > $ageMax) { $ageMax = $sql3['wiealt']; }
        $counter++;
    }
    // PUT PLAYERS IN RANDOM CLUSTERS END
    $m = 1;
    while ($m < 16) {
        $clusterCentroids = calculateCentroids(); // calculate new centroids of the clusters
        $playersWhoChangedClusters = assignPlayersToNearestCluster(); // assign each player to the nearest cluster
        if ($playersWhoChangedClusters == 0) { $m = 1001; }
        echo '<li>Iteration '.$m.': '.$playersWhoChangedClusters.' players have changed place</li>';
        $m++;
    }
    print_r($cluster);
    ?>
    

2 个答案:

答案 0 :(得分:2)

这太尴尬了:D我认为整个问题只是由一个字母引起的:

assignPlayersToNearestCluster()中,您可以找到 arsort($ abstaende); 。之后,函数 each()取第一个值。但它是 arsort 所以第一个值必须是最高的。因此它选择具有最高距离值的集群。

当然,它应该是 asort 。 :)为了证明这一点,我用 asort 测试了它 - 我在7次迭代后得到了收敛。 :)

你认为这是错误吗?如果是的话,我的问题就解决了。在那种情况下:抱歉用这个愚蠢的问题来烦你。 ;)

答案 1 :(得分:0)

编辑:无视,我仍然得到与你相同的结果,每个人都在群集4中结束。我将重新考虑我的代码并再试一次。

我想我已经意识到问题是什么,k-means聚类旨在打破集合中的差异,但是,由于你计算平均值等的方式,我们得到的情况是没有大的差距在范围内。

我可能会建议改变,只关注单个值(强度似乎对我来说最有意义)来确定聚类,或完全放弃这种排序方法,并采用不同的东西(不是你想听到的我知道的) )?

我找到了一个相当不错的网站,其中有一个使用整数的k-mean排序的例子,我将尝试编辑它,我将在明天的某个时间回来查看结果。

http://code.blip.pt/2009/04/06/k-means-clustering-in-php/&lt; - 我提到并忘了链接。