Question

我正在使用php ganon dom解析器抓取一些html页面但我被困在哪里我需要从源代码中读取一些javascript就好了。

<script type="text/javascript">
    Event.observe(window, 'load', function() {
        ig_lightbox_main_img=0;
ig_lightbox_img_sequence.push('http://someimageurl.com/image.jpg');
ig_lightbox_img_labels.push("Some text");
ig_lightbox_img_sequence.push('http://someimageurl.com/image2.jpg');
ig_lightbox_img_labels.push("Some text 2");
    });
</script>

我想阅读上面脚本的url表单，该脚本随页面的html一起提供我现在使用此代码

$html = str_get_dom('some page html here');
     foreach($html('.product-img-box script[type=text/javascript]') as $script){
     echo $script->html();
}

但这不起作用。关于如何阅读脚本的任何想法

Answer 1

尝试在type=text/javascript对象的字符串中使用$html左右的引号。

我看了here，他们有一个例子：

foreach($html('a[href ^= "http://"]') as $element) {
    $element->wrap('center');
}

我认为/可能导致它返回错误的结果。

修改

之前对这个问题感到困惑，我认为问题在于你无法获取脚本中的数据而这是因为你的选择器。无论如何，经过一番思考，如果你有一个包含数据的脚本标签的字符串副本，只需在其上运行一个正则表达式。

这是我测试的一个例子：

$string = "<script type=\"text/javascript\"> Event.observe(window, 'load', function() { ig_lightbox_main_img=0; ig_lightbox_img_sequence.push('http://someimageurl.com/image.jpg'); ig_lightbox_img_labels.push(\"Some text\"); ig_lightbox_img_sequence.push('http://someimageurl.com/image2.jpg'); ig_lightbox_img_labels.push(\"Some text 2\"); }); </script>"; $regex = "/\b(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[-A-Za-z0-9+&@#\/%=~_|$?!:,.]*[A-Za-z0-9+&@#\/%=~_|$]/"; $results = array(); preg_match_all($regex,$string,$results); var_dump($results); //Result: array(1) { [0]=> array(2) { [0]=> string(33) "http://someimageurl.com/image.jpg" [1]=> string(34) "http://someimageurl.com/image2.jpg" } }

$results包含从preg_match_all（Documentation）返回的网址数据。

如果它有帮助，一旦你有了URL，就可以在PHP中使用parse_url（Documentation），这会将字符串URL拆分成更容易使用的东西。

注意：使用的正则表达式是一个非常简单的表达式，不会涵盖所有情况。如上所述here和here，很难为此获得完美的正则表达式。

PHP ganon如何阅读javascript

1 个答案: