我正在编写一个爬虫来废弃在线商店的价格信息,我使用的是Goutte PHP并且它不支持javascript交互,它只是抓取静态HTML DOM,所以问题出在我的爬虫价格,库存信息的回复中隐藏在脚本标记内的HTML DOM中,该标记进一步用JSON封装,这是我得到的数据
<script>
HZ.productVariation.Manager.setSpaceId('33503761');
HZ.data.Variations.put('33503761', {"availVar": [{"id": "c", "label": "Color", "options": [{"name": "Chrome", "avail": 1, "stock": 1, "price": "$174.51", "quantity": "52", "imageId": "3eb1230d05775d3c"}, {"name": "SuperSteel", "avail": 1, "stock": 0, "price": "$341.40", "quantity": "0", "imageId": "d0a126f505775d3e"}]}], "curVar": {"c": "SuperSteel"}, "exactMatch": true});
HZ.productVariation.Manager.setSelector(HZ.productVariation.ListSelector);
HZ.productVariation.Manager.setRenderer(HZ.productVariation.ViewSpaceRenderer);
HZ.productVariation.Manager.setHistoryManager(new HZ.productVariation.BrowserHistoryManager("replace"));
$('.variationSelectors').append(HZ.productVariation.Manager.drawSelectors('33503761'));
HZ.productVariation.Manager.initUI();
</script>
我想创建一个正则表达式,我给它键,它将返回值,让我们说我想获得“股票”的价值所以当我在正则表达式中插入股票密钥时,它将返回与密钥库存相关的所有值作为PHP数组,当我插入具有对象数组作为值的键时,应该返回嵌套数组。
这是编码尝试:
$re = '/{"availVar":(.*)}/';
preg_match_all($re, $string, $output_array, PREG_SET_ORDER, 0);
$json = json_decode($output_array[0][0], true);
目前我正在遍历这个已解码的json数组并获取库存状态,我想要做的只是编写一个接收密钥的函数,应用正则表达式并返回此值作为结果。
无论如何都要创建这种类型的正则表达式?请指教,谢谢
答案 0 :(得分:0)
解析Json并使用json解码 我假设33503761是某种不应该出现在Json中的id。
$pos1 =strpos($html, ".put(")+17; // find "put(" and some id? And skip it
$json = substr($html, $pos1, strpos($html, ");", $pos1)-$pos1); //parse json
$arr =json_decode($json,true);
Var_dump($arr);
要查找库存,您可以使用array_walk_recursive。
function test_print($item, $key, $find)
{
if($key == $find)
echo "$key holds $item\n";
}
array_walk_recursive($arr, 'test_print', "stock");
答案 1 :(得分:0)
代码:(Demo)
function extractColumnData($html,$key){
if(!preg_match("~\.put\('\d+',\s+(.+?)\);~",$html,$out)){
return "alert yourself of the preg_match failure";
}
if(($array=@json_decode($out[1],true))===null && json_last_error()!==JSON_ERROR_NONE){
return "alert yourself of the json decode failure";
}
return array_column(current(array_column(current($array),'options')),$key,'name'); // this assumes the static structure of your json/array data
}
$html=<<<HTML
<script>
HZ.productVariation.Manager.setSpaceId('33503761');
HZ.data.Variations.put('33503761', {"availVar": [{"id": "c", "label": "Color", "options": [{"name": "Chrome", "avail": 1, "stock": 1, "price": "$174.51", "quantity": "52", "imageId": "3eb1230d05775d3c"}, {"name": "SuperSteel", "avail": 1, "stock": 0, "price": "$341.40", "quantity": "0", "imageId": "d0a126f505775d3e"}]}], "curVar": {"c": "SuperSteel"}, "exactMatch": true});
HZ.productVariation.Manager.setSelector(HZ.productVariation.ListSelector);
HZ.productVariation.Manager.setRenderer(HZ.productVariation.ViewSpaceRenderer);
HZ.productVariation.Manager.setHistoryManager(new HZ.productVariation.BrowserHistoryManager("replace"));
$('.variationSelectors').append(HZ.productVariation.Manager.drawSelectors('33503761'));
HZ.productVariation.Manager.initUI();
</script>
HTML;
echo "avail => ";
var_export(extractColumnData($html,'avail'));
echo "\n----\n";
echo "stock => ";
var_export(extractColumnData($html,'stock'));
echo "\n----\n";
echo "price => ";
var_export(extractColumnData($html,'price'));
echo "\n----\n";
echo "quantity => ";
var_export(extractColumnData($html,'quantity'));
echo "\n----\n";
echo "imageId => ";
var_export(extractColumnData($html,'imageId'));
输出:
avail => array (
'Chrome' => 1,
'SuperSteel' => 1,
)
----
stock => array (
'Chrome' => 1,
'SuperSteel' => 0,
)
----
price => array (
'Chrome' => '$174.51',
'SuperSteel' => '$341.40',
)
----
quantity => array (
'Chrome' => '52',
'SuperSteel' => '0',
)
----
imageId => array (
'Chrome' => '3eb1230d05775d3c',
'SuperSteel' => 'd0a126f505775d3e',
)