Question

在我的数据挖掘项目中，我给出了一个复杂的，巨大的多维数组，其中包含了我需要的所有信息，除了我必须先对它进行“修复”才能处理它。我已经编写了一些代码来解决这个问题，但是对于我需要“修复”的大量数据来说，它已经方式太长了，我希望有人可以帮我找到一个更有效的解决方案。

本质上，我正在使用的数组类型首先用整数索引，就像任何run-of-the-mill数组一样，即$x[0], $x[1], $x[2]，除了每个元素都是一个包含的关联数组我需要的密钥对值（例如$x[0]['item'], $x[0]['price']），但是一个密钥存储得更深一些，即ID。

数组中存在一个ID号$x[0]['@attributes']['id']，我希望通过复制此信息以及其他密钥对来简化结构，例如$x[0]['id']。

我正在使用的数据集很大，但这是我的情况的简化示例：

$attrib1 = array('id'=>'101');
$item1 = array('@attributes'=>$attrib1, 'item'=>'milk', 'price'=>'3.50');
$attrib2 = array('id'=>'102');
$item2 = array('@attributes'=>$attrib2, 'item'=>'butter', 'price'=>'2.45');
$attrib3 = array('id'=>'103');
$item3 = array('@attributes'=>$attrib3, 'item'=>'bread', 'price'=>'1.19');
$items = array($item1, $item2, $item3);
echo "Starting data - items using itemid as attribute:\n";
print_r($items);

# set item numbers by key instead of attribute
$i=0;
while(isset($items[$i]['@attributes']['id'])) {
   $items[$i]['itemid'] = $items[$i]['@attributes']['id'];
   #unset($items[$i]['@attributes']);
   $i++;
} # while
echo "\nDesired result - items using itemid as key:\n";
print_r($items);

以上是上例中的输出：

Starting data - items using itemid as attribute:
Array
(
    [0] => Array
        (
            [@attributes] => Array
                (
                    [id] => 101
                )

            [item] => milk
            [price] => 3.50
        )

    [1] => Array
        (
            [@attributes] => Array
                (
                    [id] => 102
                )

            [item] => butter
            [price] => 2.45
        )

    [2] => Array
        (
            [@attributes] => Array
                (
                    [id] => 103
                )

            [item] => bread
            [price] => 1.19
        )

)

Desired result - items using itemid as key:
Array
(
    [0] => Array
        (
            [@attributes] => Array
                (
                    [id] => 101
                )

            [item] => milk
            [price] => 3.50
            [itemid] => 101
        )

    [1] => Array
        (
            [@attributes] => Array
                (
                    [id] => 102
                )

            [item] => butter
            [price] => 2.45
            [itemid] => 102
        )

    [2] => Array
        (
            [@attributes] => Array
                (
                    [id] => 103
                )

            [item] => bread
            [price] => 1.19
            [itemid] => 103
        )

)

请注意所需结果中添加的[itemid]键值对。是否有更快/更优雅的方式来实现这一目标？我已经看了一些PHP的奇特数组函数，但我无法绕过这个更复杂的情况来使用它们。有什么想法吗？

Answer 1

内存效率

PHP DOC Comments：splFixedArray的内存占用量约为相同大小的常规“数组”的37%。

splFixedArray也实现了Iterator，这意味着它封装了列表并一次公开了一个元素的可见性，使它们更有效率。

foreach循环生成传递给它的任何数组的副本。如果您正在处理大量数据，直接在我们的阵列中使用它可能是性能问题

Also see How big are PHP arrays (and values) really? (Hint: BIG!)

你可以尝试

$it = SplFixedArray::fromArray($items);
foreach ( $it as $value ) {
    // Play with big array
}

速度

这是一个简单的基准

set_time_limit(0);
echo "<pre>";

$total = 10000;
$item = array("milk","butter","bread");
$items = array();

// Generating Random Data
for($i = 0; $i < $total; $i ++) {
    $att = array('id' => $i);
    $items[] = array('@attributes' => $att,'item' => $item[$i % 3],'price' => mt_rand(100, 5000) / 100);
}
// Pure array no copy
function m1($array) {
    foreach ( $array as $k => $v ) {
        isset($v['@attributes']) and $array[$k]['id'] = $v['@attributes']['id'];
        unset($array[$k]['@attributes']);
    }
    return $array;
}

// Array clean copy
function m2($array) {
    $items = array();
    foreach ( $array as $k => $v ) {
        isset($v['@attributes']) and $items[$k]['id'] = $v['@attributes']['id'];
        $items[$k]['item'] = $v['item'];
        $items[$k]['price'] = $v['price'];
    }
    return $items;
}

// Array Iterator
function m3($array) {
    $it = new ArrayIterator($array);
    $items = array();
    foreach ( $it as $k => $v ) {
        isset($v['@attributes']) and $items[$k]['id'] = $v['@attributes']['id'];
        $items[$k]['item'] = $v['item'];
        $items[$k]['price'] = $v['price'];
    }
    return $items;
}

// SplFixedArray Array
function m4($array) {
    $it = SplFixedArray::fromArray($array);
    $items = array();
    foreach ( $it as $k => $v ) {
        isset($v['@attributes']) and $items[$k]['id'] = $v['@attributes']['id'];
        $items[$k]['item'] = $v['item'];
        $items[$k]['price'] = $v['price'];
    }
    return $items;
}

// Array Map
function m5($array) {
    $items = array_map(function ($v) {
        isset($v['@attributes']) and $v['id'] = $v['@attributes']['id'];
        unset($v['@attributes']);
        return $v;
    }, $array);
    return $items;
}

// Array Walk
function m6($array) {
    array_walk($array, function (&$v, $k) {
        isset($v['@attributes']) and $v['id'] = $v['@attributes']['id'];
        unset($v['@attributes']);
        return $v;
    });
    return $array;
}

$result = array('m1' => 0,'m2' => 0,'m3' => 0,'m4' => 0,'m5' => 0,'m6' => 0);

for($i = 0; $i < 1; ++ $i) {
    foreach ( array_keys($result) as $key ) {
        $alpha = microtime(true);
        $key($items);
        $result[$key] += microtime(true) - $alpha;
    }
}

echo '<pre>';
echo "Single Run\n";
print_r($result);
echo '</pre>';

$result = array('m1' => 0,'m2' => 0,'m3' => 0,'m4' => 0,'m5' => 0,'m6' => 0);

for($i = 0; $i < 2; ++ $i) {
    foreach ( array_keys($result) as $key ) {
        $alpha = microtime(true);
        $key($items);
        $result[$key] += microtime(true) - $alpha;
    }
}

echo '<pre>';
echo "Dual Run\n";
print_r($result);
echo '</pre>';

它有一个非常有趣的结果

PHP 5.3.10

Single Run
Array
(
    [m1] => 0.029280185699463 <--------------- fastest
    [m2] => 0.038463115692139
    [m3] => 0.049274921417236
    [m4] => 0.03856086730957
    [m5] => 0.032699823379517
    [m6] => 0.032186985015869
)

Dual Run
Array
(
    [m1] => 0.068470001220703
    [m2] => 0.077174663543701
    [m3] => 0.085768938064575
    [m4] => 0.07695198059082
    [m5] => 0.073209047317505
    [m6] => 0.065080165863037 <--------------- Fastest after in 2 loops
)

PHP 5.4.1

Single Run
Array
(
    [m1] => 0.029529094696045
    [m2] => 0.035377979278564
    [m3] => 0.03830099105835
    [m4] => 0.034613132476807
    [m5] => 0.031363010406494
    [m6] => 0.028403043746948  <---------- fastest
)

Dual Run
Array
(
    [m1] => 0.072367191314697
    [m2] => 0.071731090545654
    [m3] => 0.078131914138794
    [m4] => 0.075049877166748
    [m5] => 0.065959930419922
    [m6] => 0.060923099517822  <---------- Fastest
)

Answer 2

看起来它来自XML，所以我想补充说@attributes可能不仅仅包含ID ..但假设不会发生你可以尝试使用foreach代替，虽然我是不确定速度增益。

可能会有影响，因为你正在修改你正在循环的同一个数组（我找不到这方面的证据，所以需要进行实验）

$cleanedArray = array();
foreach($bigArray as $subArray)
{
  if(isset($subArray['@attributes']))
  {
     $subArray['itemid'] = $subArray['@attributes']['id'];
    unset($subArray['@attributes']); //Optional
    $cleanedArray[] = $subArray;
  }
}

道歉，如果最终更慢

编辑：添加缺失索引

Answer 3

这不是一个答案，而是对所提供方法的比较：

我使用此脚本来平均算法所用的时间：

<?php
//base data
$attrib1 = array('id'=>'101');
$item1 = array('@attributes'=>$attrib1, 'item'=>'milk', 'price'=>'3.50');
$attrib2 = array('id'=>'102');
$item2 = array('@attributes'=>$attrib2, 'item'=>'butter', 'price'=>'2.45');
$attrib3 = array('id'=>'103');
$item3 = array('@attributes'=>$attrib3, 'item'=>'bread', 'price'=>'1.19');
$results = array('test1'=>array(),'test2'=>array(),'test3'=>array());

//set trials
$trials=1000;

//test 1
for($count=0;$count<$trials;$count++){
unset($items);
$items = array($item1, $item2, $item3);
$timer1=microtime();
$i=0;
while(isset($items[$i]['@attributes']['id'])) {
   $items[$i]['itemid'] = $items[$i]['@attributes']['id'];
   $i++;
}
$timer1=microtime()-$timer1;
$results['test1'][$count]=$timer1;
}

//test 2
for($count=0;$count<$trials;$count++){
unset($items);
unset($cleanedArray);
$items = array($item1, $item2, $item3);
$cleanedArray = array();
$timer2=microtime();
foreach($items as $subArray)
{
  if(isset($subArray['@attributes']))
  {
    unset($subArray['@attributes']);
    $cleanedArray[] = $subArray;
  }
}
$timer2=microtime()-$timer2;
$results['test2'][$count]=$timer2;
}

//test 3
for($count=0;$count<$trials;$count++){
unset($items);
unset($it);
$items = array($item1, $item2, $item3);
$it = SplFixedArray::fromArray($items);
$timer3=microtime();
foreach($it as $subArray)
{
  if(isset($subArray['@attributes']))
  {
    unset($subArray['@attributes']);
    $cleanedArray[] = $subArray;
  }
}
$timer3=microtime()-$timer3;
$results['test3'][$count]=$timer3;
}

//results
$factor=pow(10,-6);
echo "Test 1 averaged " . round(array_sum($results['test1']) / count($results['test1'])/$factor,1) . " µs, with range: " . round((max($results['test1'])-min($results['test1']))/$factor,1) . " µs - (min: " . (min($results['test1'])/$factor) . ", max: " . (max($results['test1'])/$factor) . ")<br/>";

echo "Test 2 averaged " . round(array_sum($results['test2']) / count($results['test2'])/$factor,1) . " µs, with range: " . round((max($results['test2'])-min($results['test2']))/$factor,1) . " µs - (min: " . (min($results['test2'])/$factor) . ", max: " . (max($results['test2'])/$factor) . ")<br/>";

echo "Test 3 averaged " . round(array_sum($results['test3']) / count($results['test3'])/$factor,1) . " µs, with range: " . round((max($results['test3'])-min($results['test3']))/$factor,1) . " µs - (min: " . (min($results['test3'])/$factor) . ", max: " . (max($results['test3'])/$factor) . ")<br/>";

echo "<pre>";
var_dump($results);
echo "</pre>";

这里的结果在低数量的试验中变化很大，但如果基础阵列较大且试验次数较多，则应该变得更加偏斜。

需要有效重组复杂的PHP关联数组

3 个答案:

内存效率

速度