为什么页面不返回任何xpath?

时间:2018-09-28 15:01:12

标签: php xpath

我已经在几个互联网页面上运行过,当我运行xpath查询(在2个不同的xpath checker chrome扩展名中起作用)时,它们不会在我运行它们的PHP页面上返回。我想知道这些页面是否具有某种类型的xpath阻止程序或某种(是的,我正在检查其robots.txt的权限)。或许还有其他伏都教?感谢您提供的任何帮助!

这是我的代码中的2行(已编辑以添加更多行):

    $c = curl_init($url);
    curl_setopt($c, CURLOPT_HEADER, false);
    curl_setopt($c, CURLOPT_USERAGENT, $this->getUserAgent());
    curl_setopt($c, CURLOPT_FAILONERROR, true);
    curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($c, CURLOPT_AUTOREFERER, true);
    curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($c, CURLOPT_TIMEOUT, 10);

    // Grab the data.
    $html = curl_exec($c);
    curl_close($c);
$dom = new DOMDocument();
@$dom->loadHtml($html);
$xpath = new DOMXPath($dom);

$jsonScripts = $xpath->query('//script[@type="application/ld+json"]');
if($TEST){echo "there are " . $jsonScripts->length . " JSONs<br>";}

并且从不会返回任何内容的互联网页面

<script type="application/ld+json">{"@context":"http:\/\/schema.org\/","@type":"Recipe","name":"Healthy Garlic Scallops Recipe","author":{"@type":"Person","name":"Florentina"},"datePublished":"2015-07-29T22:39:18+00:00","description":"Italian garlic scallops, seared to a golden perfection in a cast iron pan and cooked in healthy clarified butter for the ultimate seafood meal!","image":["https:\/\/ciaoflorentina.com\/wp-content\/uploads\/2015\/07\/Garlic-Scallops-Healthy-4.jpg"],"recipeYield":"2","prepTime":"PT5M","cookTime":"PT5M","totalTime":"PT10M","recipeIngredient":["1 lb large scallops","1\/4 c clarified butter ghee","5 cloves garlic (grated)","1  large lemon (zested)","1\/4 c Italian parsley (roughly chopped)","1\/2 tsp sea salt + more to taste","1\/4 tsp peppercorn medley (freshly ground)","1\/4 tsp red pepper flakes","A pinch of sweet paprika","1 tsp extra virgin olive oil"],"recipeInstructions":[{"@type":"HowToStep","text":"Make sure to pat dry the scallops on paper towels very well before cooking."},{"@type":"HowToStep","text":"Heat up a large cast iron skillet on medium flame."},{"@type":"HowToStep","text":"Meanwhile in a medium bowl toss the scallops with a drizzle of olive oil or butter ghee, just enough to coat it all over. Sprinkle them with the sea salt, cracked pepper, red pepper flakes and sweet paprika. Toss to coat gently."},{"@type":"HowToStep","text":"Add a little drizzle of butter ghee to the hot skillet, just enough to coat the bottom. Add the scallops making sure not to overcrowd the pan, and sear for about 2 minutes on each side until nicely golden. ( Use a small spatula to flip them over individually )"},{"@type":"HowToStep","text":"Add the butter ghee to the skillet with the scallops and then add the garlic. Remove from heat and using a spatula push the garlic around to infuse the sauce for about 30 seconds. The heat from the skillet will be enough for the garlic to work its magic into the butter. This is how you avoid that pungent burnt garlicky taste we don\u2019t like."},{"@type":"HowToStep","text":"We are just looking to extract all that sweetness from the garlic, and this is how you do it, without burning."},{"@type":"HowToStep","text":"Squeeze half of the lemon all over the scallops and move the skillet around a little so it combines with the butter. Sprinkle with the minced parsley, lemon zest and a drizzle of extra virgin olive oil. Serve with crusty bread or al dente capellini noodles."}],"recipeCategory":["Main Dishes"],"recipeCuisine":["Italian"],"aggregateRating":{"@type":"AggregateRating","ratingValue":"5","ratingCount":"8"}}</script>

1 个答案:

答案 0 :(得分:0)

服务器(Nginx)似乎正在双重响应(但有时!)。您的代码很好,如果没有获得预期的结果,可以尝试gzdecode。我一起整理了这个测试脚本进行演示。

<?php
$url = 'http://ciaoflorentina.com/garlic-scallops-recipe-healthy/';

$c = curl_init($url);
curl_setopt($c, CURLOPT_HEADER, false);
curl_setopt($c, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko');
curl_setopt($c, CURLOPT_FAILONERROR, true);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($c, CURLOPT_AUTOREFERER, true);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_TIMEOUT, 10);

// Grab the data.
$html = curl_exec($c);
curl_close($c);

$iterations = 0;
do
{
    $dom = new DOMDocument();
    @$dom->loadHTML($html);

    $xpath = new DOMXpath($dom);

    $jsonScripts = $xpath->query('//script[@type="application/ld+json"]');
    $nodeCount = $jsonScripts->length;

    echo "there are " . $nodeCount . " JSONs".PHP_EOL;

    if($nodeCount == 0)
    {
        //If garbage is coming from server, it's double encoded!
        $html = gzdecode($html);
    }

    $iterations++;
} while($nodeCount==0 && $iterations < 2);