在我的数据挖掘项目中,我给出了一个复杂的,巨大的多维数组,其中包含了我需要的所有信息,除了我必须先对它进行“修复”才能处理它。我已经编写了一些代码来解决这个问题,但是对于我需要“修复”的大量数据来说,它已经方式太长了,我希望有人可以帮我找到一个更有效的解决方案。
本质上,我正在使用的数组类型首先用整数索引,就像任何run-of-the-mill数组一样,即$x[0], $x[1], $x[2]
,除了每个元素都是一个包含的关联数组我需要的密钥对值(例如$x[0]['item'], $x[0]['price']
),但是一个密钥存储得更深一些,即ID。
数组中存在一个ID号$x[0]['@attributes']['id']
,我希望通过复制此信息以及其他密钥对来简化结构,例如$x[0]['id']
。
我正在使用的数据集很大,但这是我的情况的简化示例:
$attrib1 = array('id'=>'101');
$item1 = array('@attributes'=>$attrib1, 'item'=>'milk', 'price'=>'3.50');
$attrib2 = array('id'=>'102');
$item2 = array('@attributes'=>$attrib2, 'item'=>'butter', 'price'=>'2.45');
$attrib3 = array('id'=>'103');
$item3 = array('@attributes'=>$attrib3, 'item'=>'bread', 'price'=>'1.19');
$items = array($item1, $item2, $item3);
echo "Starting data - items using itemid as attribute:\n";
print_r($items);
# set item numbers by key instead of attribute
$i=0;
while(isset($items[$i]['@attributes']['id'])) {
$items[$i]['itemid'] = $items[$i]['@attributes']['id'];
#unset($items[$i]['@attributes']);
$i++;
} # while
echo "\nDesired result - items using itemid as key:\n";
print_r($items);
以上是上例中的输出:
Starting data - items using itemid as attribute:
Array
(
[0] => Array
(
[@attributes] => Array
(
[id] => 101
)
[item] => milk
[price] => 3.50
)
[1] => Array
(
[@attributes] => Array
(
[id] => 102
)
[item] => butter
[price] => 2.45
)
[2] => Array
(
[@attributes] => Array
(
[id] => 103
)
[item] => bread
[price] => 1.19
)
)
Desired result - items using itemid as key:
Array
(
[0] => Array
(
[@attributes] => Array
(
[id] => 101
)
[item] => milk
[price] => 3.50
[itemid] => 101
)
[1] => Array
(
[@attributes] => Array
(
[id] => 102
)
[item] => butter
[price] => 2.45
[itemid] => 102
)
[2] => Array
(
[@attributes] => Array
(
[id] => 103
)
[item] => bread
[price] => 1.19
[itemid] => 103
)
)
请注意所需结果中添加的[itemid]键值对。是否有更快/更优雅的方式来实现这一目标?我已经看了一些PHP的奇特数组函数,但我无法绕过这个更复杂的情况来使用它们。有什么想法吗?
答案 0 :(得分:2)
PHP DOC Comments:splFixedArray
的内存占用量约为相同大小的常规“数组”的37%
。
splFixedArray
也实现了Iterator
,这意味着它封装了列表并一次公开了一个元素的可见性,使它们更有效率。
foreach
循环生成传递给它的任何数组的副本。如果您正在处理大量数据,直接在我们的阵列中使用它可能是性能问题
Also see How big are PHP arrays (and values) really? (Hint: BIG!)
你可以尝试
$it = SplFixedArray::fromArray($items);
foreach ( $it as $value ) {
// Play with big array
}
这是一个简单的基准
set_time_limit(0);
echo "<pre>";
$total = 10000;
$item = array("milk","butter","bread");
$items = array();
// Generating Random Data
for($i = 0; $i < $total; $i ++) {
$att = array('id' => $i);
$items[] = array('@attributes' => $att,'item' => $item[$i % 3],'price' => mt_rand(100, 5000) / 100);
}
// Pure array no copy
function m1($array) {
foreach ( $array as $k => $v ) {
isset($v['@attributes']) and $array[$k]['id'] = $v['@attributes']['id'];
unset($array[$k]['@attributes']);
}
return $array;
}
// Array clean copy
function m2($array) {
$items = array();
foreach ( $array as $k => $v ) {
isset($v['@attributes']) and $items[$k]['id'] = $v['@attributes']['id'];
$items[$k]['item'] = $v['item'];
$items[$k]['price'] = $v['price'];
}
return $items;
}
// Array Iterator
function m3($array) {
$it = new ArrayIterator($array);
$items = array();
foreach ( $it as $k => $v ) {
isset($v['@attributes']) and $items[$k]['id'] = $v['@attributes']['id'];
$items[$k]['item'] = $v['item'];
$items[$k]['price'] = $v['price'];
}
return $items;
}
// SplFixedArray Array
function m4($array) {
$it = SplFixedArray::fromArray($array);
$items = array();
foreach ( $it as $k => $v ) {
isset($v['@attributes']) and $items[$k]['id'] = $v['@attributes']['id'];
$items[$k]['item'] = $v['item'];
$items[$k]['price'] = $v['price'];
}
return $items;
}
// Array Map
function m5($array) {
$items = array_map(function ($v) {
isset($v['@attributes']) and $v['id'] = $v['@attributes']['id'];
unset($v['@attributes']);
return $v;
}, $array);
return $items;
}
// Array Walk
function m6($array) {
array_walk($array, function (&$v, $k) {
isset($v['@attributes']) and $v['id'] = $v['@attributes']['id'];
unset($v['@attributes']);
return $v;
});
return $array;
}
$result = array('m1' => 0,'m2' => 0,'m3' => 0,'m4' => 0,'m5' => 0,'m6' => 0);
for($i = 0; $i < 1; ++ $i) {
foreach ( array_keys($result) as $key ) {
$alpha = microtime(true);
$key($items);
$result[$key] += microtime(true) - $alpha;
}
}
echo '<pre>';
echo "Single Run\n";
print_r($result);
echo '</pre>';
$result = array('m1' => 0,'m2' => 0,'m3' => 0,'m4' => 0,'m5' => 0,'m6' => 0);
for($i = 0; $i < 2; ++ $i) {
foreach ( array_keys($result) as $key ) {
$alpha = microtime(true);
$key($items);
$result[$key] += microtime(true) - $alpha;
}
}
echo '<pre>';
echo "Dual Run\n";
print_r($result);
echo '</pre>';
它有一个非常有趣的结果
PHP 5.3.10
Single Run
Array
(
[m1] => 0.029280185699463 <--------------- fastest
[m2] => 0.038463115692139
[m3] => 0.049274921417236
[m4] => 0.03856086730957
[m5] => 0.032699823379517
[m6] => 0.032186985015869
)
Dual Run
Array
(
[m1] => 0.068470001220703
[m2] => 0.077174663543701
[m3] => 0.085768938064575
[m4] => 0.07695198059082
[m5] => 0.073209047317505
[m6] => 0.065080165863037 <--------------- Fastest after in 2 loops
)
PHP 5.4.1
Single Run
Array
(
[m1] => 0.029529094696045
[m2] => 0.035377979278564
[m3] => 0.03830099105835
[m4] => 0.034613132476807
[m5] => 0.031363010406494
[m6] => 0.028403043746948 <---------- fastest
)
Dual Run
Array
(
[m1] => 0.072367191314697
[m2] => 0.071731090545654
[m3] => 0.078131914138794
[m4] => 0.075049877166748
[m5] => 0.065959930419922
[m6] => 0.060923099517822 <---------- Fastest
)
答案 1 :(得分:1)
看起来它来自XML,所以我想补充说@attributes可能不仅仅包含ID ..但假设不会发生你可以尝试使用foreach代替,虽然我是不确定速度增益。
可能会有影响,因为你正在修改你正在循环的同一个数组(我找不到这方面的证据,所以需要进行实验)
$cleanedArray = array();
foreach($bigArray as $subArray)
{
if(isset($subArray['@attributes']))
{
$subArray['itemid'] = $subArray['@attributes']['id'];
unset($subArray['@attributes']); //Optional
$cleanedArray[] = $subArray;
}
}
道歉,如果最终更慢
编辑:添加缺失索引
答案 2 :(得分:0)
这不是一个答案,而是对所提供方法的比较:
我使用此脚本来平均算法所用的时间:
<?php
//base data
$attrib1 = array('id'=>'101');
$item1 = array('@attributes'=>$attrib1, 'item'=>'milk', 'price'=>'3.50');
$attrib2 = array('id'=>'102');
$item2 = array('@attributes'=>$attrib2, 'item'=>'butter', 'price'=>'2.45');
$attrib3 = array('id'=>'103');
$item3 = array('@attributes'=>$attrib3, 'item'=>'bread', 'price'=>'1.19');
$results = array('test1'=>array(),'test2'=>array(),'test3'=>array());
//set trials
$trials=1000;
//test 1
for($count=0;$count<$trials;$count++){
unset($items);
$items = array($item1, $item2, $item3);
$timer1=microtime();
$i=0;
while(isset($items[$i]['@attributes']['id'])) {
$items[$i]['itemid'] = $items[$i]['@attributes']['id'];
$i++;
}
$timer1=microtime()-$timer1;
$results['test1'][$count]=$timer1;
}
//test 2
for($count=0;$count<$trials;$count++){
unset($items);
unset($cleanedArray);
$items = array($item1, $item2, $item3);
$cleanedArray = array();
$timer2=microtime();
foreach($items as $subArray)
{
if(isset($subArray['@attributes']))
{
unset($subArray['@attributes']);
$cleanedArray[] = $subArray;
}
}
$timer2=microtime()-$timer2;
$results['test2'][$count]=$timer2;
}
//test 3
for($count=0;$count<$trials;$count++){
unset($items);
unset($it);
$items = array($item1, $item2, $item3);
$it = SplFixedArray::fromArray($items);
$timer3=microtime();
foreach($it as $subArray)
{
if(isset($subArray['@attributes']))
{
unset($subArray['@attributes']);
$cleanedArray[] = $subArray;
}
}
$timer3=microtime()-$timer3;
$results['test3'][$count]=$timer3;
}
//results
$factor=pow(10,-6);
echo "Test 1 averaged " . round(array_sum($results['test1']) / count($results['test1'])/$factor,1) . " µs, with range: " . round((max($results['test1'])-min($results['test1']))/$factor,1) . " µs - (min: " . (min($results['test1'])/$factor) . ", max: " . (max($results['test1'])/$factor) . ")<br/>";
echo "Test 2 averaged " . round(array_sum($results['test2']) / count($results['test2'])/$factor,1) . " µs, with range: " . round((max($results['test2'])-min($results['test2']))/$factor,1) . " µs - (min: " . (min($results['test2'])/$factor) . ", max: " . (max($results['test2'])/$factor) . ")<br/>";
echo "Test 3 averaged " . round(array_sum($results['test3']) / count($results['test3'])/$factor,1) . " µs, with range: " . round((max($results['test3'])-min($results['test3']))/$factor,1) . " µs - (min: " . (min($results['test3'])/$factor) . ", max: " . (max($results['test3'])/$factor) . ")<br/>";
echo "<pre>";
var_dump($results);
echo "</pre>";
这里的结果在低数量的试验中变化很大,但如果基础阵列较大且试验次数较多,则应该变得更加偏斜。