I get one page html source via phpQuery, and then get below string code from script tag in head via php regex:
var BASE_DATA = {
userInfo: {
id: 0,
userName: 'no-needed',
avatarUrl: 'no-needed',
isPgc: false,
isOwner: false
},
headerInfo: {
id: 0,
isPgc: false,
userName: 'no-needed',
avatarUrl: 'no-needed',
isHomePage: false,
crumbTag: 'no-needed',
hasBar: true
},
articleInfo:
{
title: 'needed',
content: 'needed',
groupId: 'needed',
itemId: 'needed',
type: 1,
subInfo: {
isOriginal: false,
source: 'needed',
time: 'needed'
},
tagInfo: {
tags: [{"name":"no-needed 1"},{"name":"no-needed 2"},{"name":"no-needed 3"}],
groupId: 'no-needed',
itemId: 'no-needed',
repin: 0,
},
has_extern_link: 0,
coverImg: 'no-needed'
},
commentInfo:
{
groupId: 'no-needed',
itemId: 'no-needed',
comments_count: 151,
ban_comment: 0
},};
I want to convert this string to php array, like:
$base_data = array(
'articleInfo' => array(
'title' => 'needed',
'content' => 'needed',
'groupId' => 'needed',
'itemId' => 'needed',
'subInfo' => array(
'source' => 'needed',
'time' => 'needed',
),
));
or
$base_data = array(
'title' => 'needed',
'content' => 'needed',
'groupId' => 'needed',
'itemId' => 'needed',
'subInfo' => array(
'source' => 'needed',
'time' => 'needed',
),);
I already tried with many ways, like: json_decode, get the content from the braces via php regex and the function preg_match_all.But all of them run not well.
I tried two ways:
the first way:
$json = str_ireplace(array('var BASE_DATA =', '};'), array('', '}'), $js);
json_decode($json, true);
the second way:
preg_match_all('/\{([^}]+)\}/', $js, $matches);
print_r($matches[1]);
or
preg_match_all('/articleInfo:\s*\{([^}]+)\}/', $script_text, $matches);
print_r($matches[1][0]);
It seems to close to finish, but it still looks no well, I have to parser string in articleInfo part.... that is why I posted this post.
I even wanted to use V8 JavaScript engine, but.....
do you anyone know the better way to finish it please ?
答案 0 :(得分:1)
I had to reformat your JSON which was not valid (checked on https://jsonlint.com/).
I voluntarily used multiple str_replace() so you better understand the process, however you can optimize the code below by making multiple replacements at the same time within the same str_replace().
This works:
<?php
$to_decode = "var BASE_DATA = {
userInfo: {
id: 0,
userName: 'no-needed',
avatarUrl: 'no-needed',
isPgc: false,
isOwner: false
},
headerInfo: {
id: 0,
isPgc: false,
userName: 'no-needed',
avatarUrl: 'no-needed',
isHomePage: false,
crumbTag: 'no-needed',
hasBar: true
},
articleInfo:
{
title: 'needed',
content: 'needed',
groupId: 'needed',
itemId: 'needed',
type: 1,
subInfo: {
isOriginal: false,
source: 'needed',
time: 'needed'
},
tagInfo: {
tags: [{\"name\":\"no-needed 1\"},{\"name\":\"no-needed 2\"},{\"name\":\"no-needed 3\"}],
groupId: 'no-needed',
itemId: 'no-needed',
repin: 0,
},
has_extern_link: 0,
coverImg: 'no-needed'
},
commentInfo:
{
groupId: 'no-needed',
itemId: 'no-needed',
comments_count: 151,
ban_comment: 0
},};";
/* Clean JSON and encapsulate in brackets */
$to_decode = str_replace('var BASE_DATA = {', '', $to_decode);
$to_decode = '{'.substr($to_decode, 0, -3).'}';
/* Remove spaces, tabs, new lines, etc. */
$to_decode = str_replace(' ', '', $to_decode);
$to_decode = str_replace("\n", '', $to_decode);
$to_decode = str_replace("\t", '', $to_decode);
$to_decode = str_replace("\r", '', $to_decode);
/* Encapsulate keys with quotes */
$to_decode = preg_replace('/([a-z_]+)\:/ui', '"{$1}":', $to_decode);
$to_decode = str_replace('"{', '"', $to_decode);
$to_decode = str_replace('}"', '"', $to_decode);
$to_decode = str_replace('\'', '"', $to_decode);
/* Remove unecessary trailing commas */
$to_decode = str_replace(',}', '}', $to_decode);
echo '<pre>';
var_dump(json_decode($to_decode));
Result using print_r :
(I added true/false for clarity, these will only show using var_dump() otherwise)
stdClass Object
(
[userInfo] => stdClass Object
(
[id] => 0
[userName] => no-needed
[avatarUrl] => no-needed
[isPgc] => false
[isOwner] => false
)
[headerInfo] => stdClass Object
(
[id] => 0
[isPgc] => false
[userName] => no-needed
[avatarUrl] => no-needed
[isHomePage] => false
[crumbTag] => no-needed
[hasBar] => true
)
[articleInfo] => stdClass Object
(
[title] => needed
[content] => needed
[groupId] => needed
[itemId] => needed
[type] => 1
[subInfo] => stdClass Object
(
[isOriginal] => false
[source] => needed
[time] => needed
)
[tagInfo] => stdClass Object
(
[tags] => Array
(
[0] => stdClass Object
(
[name] => no-needed1
)
[1] => stdClass Object
(
[name] => no-needed2
)
[2] => stdClass Object
(
[name] => no-needed3
)
)
[groupId] => no-needed
[itemId] => no-needed
[repin] => 0
)
[has_extern_link] => 0
[coverImg] => no-needed
)
[commentInfo] => stdClass Object
(
[groupId] => no-needed
[itemId] => no-needed
[comments_count] => 151
[ban_comment] => 0
)
)
答案 1 :(得分:0)
感谢@Bruno Leveque的想法。
我对您的代码进行了如下编辑,以使其正常运行:
我将$to_decode = str_replace(' ', '', $to_decode);
更改为$to_decode = preg_replace('/[\n| |\s]{2,}/',' ',$to_decode);
,这意味着所有1+空间都将更改为1空间。因为有时我们需要空间,例如:内容:'
我在您的评论代码$to_decode = str_replace("'", '"', $to_decode);
/* Encapsulate keys with quotes */
将$to_decode = preg_replace('/([a-z_]+)\:/ui', '"{$1}":', $to_decode);
更改为$to_decode = preg_replace('/([a-z_]+)\: /ui', '"$1":', $to_decode);
(那里还有一个空格);并评论了//$to_decode = str_replace('"{', '"', $to_decode);
和//$to_decode = str_replace('}"', '"', $to_decode);
又添加了一个代码:$to_decode = str_replace(", }", '}', $to_decode);
因为@Bruno Leveque不知道“需要”和“不需要”的确切内容,所以谢谢您的想法。
似乎没有完美的方法。...