PHP解析器 - 在HTML中查找字符串

时间:2013-12-16 00:22:30

标签: php html parsing

我想在另一个网站上找到一个字符串。我一直在寻找解析器,我不知道最好的方法。我查看了一个HTML DOM解析器,但我只需要一个简单的一行输出。我只想将“url:'http://s2.example.com/streams/i23374.mp4?k=12f34588cf171f3bbf3d35da4db43b06'”链接到变量。

<script>
                flowplayer("player", "http://www.example.com/flowplayer-3.2.16.swf",  {
                    canvas: {
                        backgroundGradient: "none",
                        backgroundColor: "#000000"
                    },
                    clip: {
                        provider: 'lighttpd',
                        url: 'http://s1.example.com/streams/i23374.mp4?k=12f34588cf171f3bbf3d35da4db43b06',
                        scaling: 'fit'
                    },
                    plugins: {
                        lighttpd: {
                            url: 'http://www.example.com/flowplayer.pseudostreaming-3.2.12.swf'
                        }
                    }
                });
            </script>

3 个答案:

答案 0 :(得分:0)

这是一个方便的函数,用于从两个分隔符之间抓取文本;

<?php
function extract_unit($string, $start, $end)
{
    $pos = stripos($string, $start);
    $str = substr($string, $pos);
    $str_two = substr($str, strlen($start));
    $second_pos = stripos($str_two, $end);
    $str_three = substr($str_two, 0, $second_pos);
    $unit = trim($str_three); // remove whitespaces
    return $unit;
}

echo extract_unit($webpageSource, 'flowplayer("player", "', '",  {');
?>

答案 1 :(得分:0)

我会使用DOMDocument

为了从锚点获取链接,它是:

$dd = new DOMDocument;
@$dd->loadHTMLFile('http://s2.example.com/streams/i23374.mp4?k=12f34588cf171f3bbf3d35da4db43b06');
if($a = $dd->getElementsByTagName('a')){
  foreach($a as $t){
    $links[] = $t->getAttribute('href');
  }
}

现在$links是一个数组,每个href,或if(!isset($links))没有结果。

从脚本标记获取JSON:

$dd = new DOMDocument;
@$dd->loadHTMLFile('http://s2.example.com/streams/i23374.mp4?k=12f34588cf171f3bbf3d35da4db43b06');
if($s = $dd->getElementsByTagName('script')){
  $c = $dd->sameHTML($s->item(0)));
}

item(0)更改为其页面上script标记的级别。现在$c是一个字符串。所以:

preg_match_all("/url: '.+'/", $c, $results);

$results是一个数组应该包含url: 'whatever'。 所以:

foreach($results as $v){
  $a[] = preg_replace('/url: /', '', $v);
}

$a是结果数组。

答案 2 :(得分:0)

主要是RegExp是解析字符串的最佳方法,虽然不建议它处理JSON。

这是一个例子(我编码了字符串,它与原始HTML相同):

<?php 
$data = base64_decode("PHNjcmlwdD4KICAgICAgICAgICAgICAgIGZsb3dwbGF5ZXIoInBsYXllciIsICJodHRwOi8vd3d3LmV4YW1wbGUuY29tL2Zsb3dwbGF5ZXItMy4yLjE2LnN3ZiIsICB7CiAgICAgICAgICAgICAgICAgICAgY2FudmFzOiB7CiAgICAgICAgICAgICAgICAgICAgICAgIGJhY2tncm91bmRHcmFkaWVudDogIm5vbmUiLAogICAgICAgICAgICAgICAgICAgICAgICBiYWNrZ3JvdW5kQ29sb3I6ICIjMDAwMDAwIgogICAgICAgICAgICAgICAgICAgIH0sCiAgICAgICAgICAgICAgICAgICAgY2xpcDogewogICAgICAgICAgICAgICAgICAgICAgICBwcm92aWRlcjogJ2xpZ2h0dHBkJywKICAgICAgICAgICAgICAgICAgICAgICAgdXJsOiAnaHR0cDovL3MxLmV4YW1wbGUuY29tL3N0cmVhbXMvaTIzMzc0Lm1wND9rPTEyZjM0NTg4Y2YxNzFmM2JiZjNkMzVkYTRkYjQzYjA2JywKICAgICAgICAgICAgICAgICAgICAgICAgc2NhbGluZzogJ2ZpdCcKICAgICAgICAgICAgICAgICAgICB9LAogICAgICAgICAgICAgICAgICAgIHBsdWdpbnM6IHsKICAgICAgICAgICAgICAgICAgICAgICAgbGlnaHR0cGQ6IHsKICAgICAgICAgICAgICAgICAgICAgICAgICAgIHVybDogJ2h0dHA6Ly93d3cuZXhhbXBsZS5jb20vZmxvd3BsYXllci5wc2V1ZG9zdHJlYW1pbmctMy4yLjEyLnN3ZicKICAgICAgICAgICAgICAgICAgICAgICAgfQogICAgICAgICAgICAgICAgICAgIH0KICAgICAgICAgICAgICAgIH0pOwogICAgICAgICAgICA8L3NjcmlwdD4=");

if(preg_match('/clip:\s*\{[\s\S]+url:\s*\'(\S+)\',\s*scaling/', $data, $match) === 1)
echo $match[1];

?>

虽然它是用JSON编码的,但它不能被PHP的json_decode解析,因为PHP的JSON格式太严格(属性应该用引号括起来)。