使用PHP从HTML中提取JSON

时间:2015-06-07 13:46:57

标签: php regex json preg-match

我正在阅读在线商店网站的源代码,在每个产品页面上,我需要找到一个显示产品SKU及其数量的JSON字符串。

以下是2个样本:

public class LocalizationUpdaterActivity extends Activity {

    private String[] languages = { "English", "Francais", "Espanol", "Ivrit" };
    /**
     * Called when the activity is first created.
     */
    @Override
    public void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_langues);

        SharedPreferences sp = this.getApplicationContext().getSharedPreferences("loginSaved", Context.MODE_PRIVATE);
        final SharedPreferences.Editor editor = sp.edit();


        Spinner spinner = (Spinner) findViewById(R.id.spinner1);
        spinner.setPrompt("select language");

        ArrayAdapter<String> adapter = new ArrayAdapter<String>(this,
                android.R.layout.simple_spinner_item, languages);
        adapter.setDropDownViewResource(android.R.layout.simple_spinner_dropdown_item);
        spinner.setAdapter(adapter);

        spinner.setOnItemSelectedListener(new AdapterView.OnItemSelectedListener() {

            public void onItemSelected(AdapterView arg0, View arg1,
                                       int arg2, long arg3) {
                Configuration config = new Configuration();
                switch (arg2) {
                    case 0:
                        config.locale = Locale.ENGLISH;
                        editor.putString("Langues", "en_US");
                        break;
                    case 1:
                        config.locale = Locale.FRENCH;
                        editor.putString("Langues", "fr_FR");
                        break;
                    case 2:
                        config.locale = new Locale("es_ES");
                        editor.putString("Langues", "es_ES");
                        break;
                    case 3:
                        config.locale = new Locale("he", "IL");
                        editor.putString("Langues", "he_IL");
                        break;
                    default:
                        config.locale = Locale.ENGLISH;
                        editor.putString("Langues", "en_US");
                        break;
                }
                popup("Warning !","The App will retart to apply the changes");
                getResources().updateConfiguration(config, null);
            }

            public void onNothingSelected(AdapterView arg0) {
                // TODO Auto-generated method stub

            }

        });
    }

    public void killApplication(Activity activity) {
        //Broadcast the command to kill all activities
        Intent intent = new Intent("kill");
        intent.setType("content://all");
        activity.sendBroadcast(intent);
    }

    public void restartApplication() {
        killApplication(this);

        //Start the launch activity
        Intent i = this.getBaseContext().getPackageManager().getLaunchIntentForPackage(this.getBaseContext().getPackageName());
        this.startActivity(i);
    }
    public void popup(String titre, String texte) {
        AlertDialog.Builder alertDialogBuilder = new AlertDialog.Builder(this,
                AlertDialog.THEME_HOLO_DARK);
        alertDialogBuilder.setTitle(titre).setMessage(texte)
                .setCancelable(false)
                .setNegativeButton("Ok", new DialogInterface.OnClickListener() {
                    @Override
                    public void onClick(DialogInterface dialog, int id) {
                        LocalizationUpdaterActivity.this.restartApplication();
                    }
                });
        alertDialogBuilder.create();
        alertDialogBuilder.show();
    }

    public static void CopyStream(InputStream is, OutputStream os) {
        final int buffer_size = 1024;
        try {
            byte[] bytes = new byte[buffer_size];
            for (;;) {
                int count = is.read(bytes, 0, buffer_size);
                if (count == -1)
                    break;
                os.write(bytes, 0, count);
            }
        } catch (Exception ex) {
        }
    }

}

上面的示例显示了3个SKU。

gameDisplay = pygame.display.set_mode((display_width, display_height))

上面的示例显示了更多SKU。

JSON字符串中的SKU数量范围从1到无穷大。

现在,我需要一个正则表达式模式从每个页面中提取此JSON字符串。那时,我可以轻松使用'{"sku-SV023435_B_M":7,"sku-SV023435_BL_M":10,"sku-SV023435_PU_M":11}'

更新: 在这里我发现了另一个问题,抱歉我的问题没有完成,还有另一个类似的json字符串,它是以sku-开头的,请看一下你会理解的下面链接的源代码,唯一的区别就是那一个的值是字母数字,我们要求的是数字。另请注意我们的最终目标是提取数量的SKU,也许您有一个最直接的解决方案。

Source

@ chris85

第二次更新:

这是另一个奇怪的问题,有点偏离主题。

当我使用下面的代码打开URL内容时,源代码中没有json字符串!

'{"sku-11430_B_S":"20","sku-11430_B_M":"17","sku-11430_B_L":"30","sku-11430_B_XS":"13","sku-11430_BL_S":"7","sku-11430_BL_M":"17","sku-11430_BL_L":"4","sku-11430_BL_XS":"16","sku-11430_O_S":"8","sku-11430_O_M":"6","sku-11430_O_L":"22","sku-11430_O_XS":"20","sku-11430_LBL_S":"27","sku-11430_LBL_M":"25","sku-11430_LBL_L":"22","sku-11430_LBL_XS":"10","sku-11430_Y_S":"24","sku-11430_Y_M":36,"sku-11430_Y_L":"20","sku-11430_Y_XS":"6","sku-11430_RR_S":"4","sku-11430_RR_M":"35","sku-11430_RR_L":"47","sku-11430_RR_XS":"6"}',

但是当我用浏览器打开网址时,json就在那里!真的很困惑:(

3 个答案:

答案 0 :(得分:0)

您需要使用preg_match_all()执行正则表达式匹配操作(文档here)。

以下内容应该为您完成。它将匹配以“sku”开头并以“,”结尾的每个子字符串。

preg_match_all("/sku\-.+?:[0-9]*/", $input)

工作示例here

或者,如果要提取整个字符串,可以使用:

preg_match_all("/{.sku\-.*}/, $input")

这将抓住开始和结束括号之间的所有内容。

工作示例here

请注意$input表示输入字符串。

答案 1 :(得分:0)

简单的/'(\{"[^\}]+\})'/将匹配所有这些JSON字符串。演示:https://regex101.com/r/wD5bO4/2

返回数组的第一个元素将包含json_decode的JSON字符串:

preg_match_all ("/'(\{\"[^\}]+\})'/", $html, $matches);

$html是要解析的HTML,JSON将在$ matches [0] [1],$ matches [1] [1],$ matches [2] [1]等。

答案 2 :(得分:0)

由于json的编码方式,尝试直接使用regexp从json中提取特定数据通常总是一个坏主意。最好的方法是对整个json数据进行regexp,然后使用php函数json_decode进行解码。

缺少数据的问题是由于缺少必需的cookie。请参阅下面的代码中的我的评论。

<?php

function getHtmlFromDresslinkUrl($url)
{
    $ch = curl_init();
    curl_setopt($ch,CURLOPT_URL,$url);
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);

    //You must send the currency cookie to the website for it to return the json you want to scrape
    curl_setopt($ch, CURLOPT_HTTPHEADER, array(
        'Cookie: currencies_code=USD;',
    ));

    $output=curl_exec($ch);

    curl_close($ch);
    return $output;
}

$html = getHtmlFromDresslinkUrl("http://www.dresslink.com/womens-candy-color-basic-coat-slim-suit-jacket-blazer-p-8131.html");

//Get the specific arguments for this js function call only
$items = preg_match("/DL\.items\_list\.initItemAttr\((.+)\)\;/", $html, $matches);
if (count($matches) > 0) {
    $arguments = $matches[1];

    //Split by argument seperator.  
    //I know, this isn't great but it seems to work.
    $args_array = explode(", ", $arguments);

    //You need the 5th argument
    $fourth_arg = $args_array[4];

    //Strip quotes
    $fourth_arg = trim($fourth_arg, "'");

    //json_decode
    $qty_data = json_decode($fourth_arg, true);

    //Then you can work with the php array
    foreach ($qty_data as $name => $qtty) {
        echo "Found " . $qtty . " of " . $name . "<br />";
    }
}

?>

特别感谢@ chris85让我再次阅读这个问题。对不起,但我无法取消我的downvote。