Question

我已经用php编写了一个脚本，以从 hair fall shamboo 刮掉一个 title 网页。当我执行下面的脚本时，出现以下错误：

注意：尝试在第16行的C：\ xampp \ htdocs \ runcode \ testfile.php中获取非对象的属性“ nodeValue”。

Link to that site

我尝试过的脚本：

<?php function get_content($url){ $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0'); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_exec($ch); $htmlContent = curl_exec($ch); curl_close($ch); return $htmlContent; } $link = "https://www.purplle.com/search?q=hair%20fall%20shamboo"; $xml = get_content($link); $dom = @DOMDocument::loadHTML($xml); $xpath = new DOMXPath($dom); $title = $xpath->query('//h1[@class="br-hdng"]/span')->item(0)->nodeValue; echo "{$title}"; ?>

我的预期输出是：

hair fall shamboo

尽管我在上述脚本中使用的xpath似乎是正确的，但我在此处粘贴了可以在其中找到title的html元素的相关部分：

<h1 _ngcontent-c0="" class="br-hdng"><span _ngcontent-c0="" class="pr dib">hair fall shamboo</span></h1>

PostScript： ：我要解析的title是动态加载的。由于我是php新手，所以我不了解我尝试的方法是否正确。如果没有，那我该怎么办？

以下是我使用两种不同语言创建的脚本，发现它们像魔术一样工作。

我成功使用javascript：

const puppeteer = require('puppeteer'); function run () { return new Promise(async (resolve, reject) => { try { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto("https://www.purplle.com/search?q=hair%20fall%20shamboo"); let urls = await page.evaluate(() => { let items = document.querySelector('h1.br-hdng span'); return items.innerText;; }) browser.close(); return resolve(urls); } catch (e) { return reject(e); } }) } run().then(console.log).catch(console.error);

同样，我成功使用python：

import requests_html with requests_html.HTMLSession() as session: r = session.get('https://www.purplle.com/search?q=hair%20fall%20shamboo') r.html.render() item = r.html.find("h1.br-hdng span",first=True).text print(item)

那么php怎么了？

Answer 1

很可能您的代码中的问题比我在此答案中讨论的要多，但是我看到的最突出的问题是：

DOMDocument::loadHTML()不是静态方法，而是实例方法（返回布尔值）。您应该首先创建DOMDocument的实例，然后在该实例上调用loadHTML()：

$dom = new DOMDocument;
$dom->loadHTML($xml);

但是，由于您抑制了特定行上@运算符的错误，因此您不会收到任何警告。并且尽管很常见的是使用错误抑制器运算符@来抑制HTML验证错误，但是像这样，您应该考虑使用libxml_use_internal_errors() ¹，因为这样做无法抑制一般的PHP错误。

$dom = new DOMDocument;
$oldSetting = libxml_use_internal_errors(true);
$dom->loadHTML($xml);
libxml_use_internal_errors($oldSetting);

最后一点：
如果将PHP安装配置为允许通过配置设置DOMDocument::loadHTMLFile()加载URL，则可以直接使用allow_url_fopen从URL加载DOM文档（无需cURL）。请注意，尽管出于安全原因通常会禁用此设置，但如果打算使用它，请谨慎使用。

这是一个简单的测试用例，应能按预期工作：

<?php

$html = '
<html>
<head>
  <title>DOMDocument test-case</title>
</head>
<body>
  <div class="dummy-container">
    <h1 _ngcontent-c0="" class="br-hdng"><span _ngcontent-c0="" class="pr dib">hair fall shamboo<!----></span></h1>
  </div>
</body>';

$dom = new DOMDocument;

$oldSetting = libxml_use_internal_errors(true);
$dom->loadHTML( $html );
libxml_use_internal_errors($oldSetting);

$xpath = new DOMXPath( $dom );
$title = $xpath->query( '//h1[@class="br-hdng"]/span' )->item( 0 )->nodeValue;
echo $title;

^{See this example interpreted online on 3v4l.org}

您应将$html的内容替换为get_content()调用的输出。如果它不起作用，那么：

使用cURL来获取HTML有点问题（例如，在加载到var_dump( $html );之前先做DOMDocument，以查看您检索到的内容），或者...
也许您正在命名空间中工作，在这种情况下，应在DOMDocument和DOMXPath之前加反斜杠，即：new \DOMDocument;和new \DOMXPath( $dom );。 / p>

^{1。 LibXML是DOMDocument用来解析XML / HTML文档的XML库。}

Answer 2

那么php有什么问题？

php无法运行javascript。大概，您的javascript代码中的puppeteer和python代码中的request_html都运行了javascript。

您的问题是此页面使用JavaScript加载了br-hdng标题和产品，它根本不是HTML的一部分。它实际上是从https://www.purplle.com/api/shop/itemsv3加载的，带有一堆GET参数。您需要在此处进行JSON解析，而不是HTML解析:)，但是在访问该api之前，您需要搜索页面提供的cookie，并且搜索字符串必须与api搜索字符串匹配（否则， api只会返回错误），请检查以下内容：

<?php
declare(strict_types = 0);
header ( "Content-Type: text/plain;charset=UTF-8" );
$ch = curl_init ();
curl_setopt_array ( $ch, array (
        CURLOPT_ENCODING => '',
        CURLOPT_COOKIEFILE => '', // enables cookie handling without saving them anywhere. this page requires cookie handling.
        CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0', // 'libcurl/? PHP/' . PHP_VERSION, // many websites block requests without a useragent
        CURLOPT_RETURNTRANSFER => 1 
) );
// we don't care what's on this page, we just need to fetch it to create a cookie session.
$search_query = 'hair fall shamboo';
curl_setopt ( $ch, CURLOPT_URL, 'https://www.purplle.com/search?q=' . rawurlencode ( $search_query ) );
curL_exec ( $ch );
$url = 'https://www.purplle.com/api/shop/itemsv3?' . http_build_query ( array (
        'list_type' => 'search',
        'custom' => '',
        'list_type_value' => $search_query,
        'page' => '1',
        'sort_by' => 'rel',
        'elite' => '0' 
) );
// $url = 'https://www.purplle.com/api/shop/itemsv3?list_type=search&custom=&list_type_value=hair%20fall%20shamboo&page=1&sort_by=rel&elite=0';
// $out = tmpfile ();
// curl_setopt_array ( $ch, array (
// CURLOPT_HTTPHEADER => array (
// 'Accept: application/json, text/plain, */*',
// 'Accept-Language: en-US,en;q=0.5',
// 'Referer: https://www.purplle.com/search?q=hair%20fall%20shamboo',
// // Cookie: __cfduid=d3199415b5ce18cbff2779802b1f843331544901552; csrftoken=f8f18b5deae92972f63343e13c6a460b; purpllesession=hedxkc%2FkdGye%2BYi6ebmJktUN1LeqA5rdVXu96%2F0j0yqtP2xZ8LfwpK8daXqPSkeZulO9ZvqpMYXTmY8oMD03VcG9vdKGBm30R9fU%2FQygtXBFhZvfvsu0scyaL3FqHbePp2zG45MevWU961eg82KAkCuHk0qFM8URQBRyYV5gg8TeqnTPgI3tF87H5nJ%2BmfO4pn%2BRWmIuWXvgNXAO%2F8GEaH6lJVl17QZm9c5vwi10OYeLfmSdIMy6V2Pp0ZjLTzuFw2de5jpR0zsbHHKZ0C2e548PiDl3taHIE5wuZO4HYIeXUqTpE98%2Fo3kztoU1bTlXGZgu%2FxVQ3EWLRFWQ2t57UawA%2FuERlD8vvOyFGbYHGAWVxgFTR%2FObAhFLHns5kqoj; _autm30d=null; visitorppl=NZ5tqQpGlFYWg2MrDl1302113161544901552; session_initiated=Direct; _tmpsess=1; token=desktop_5c1553b07c61c_7955_16122018; __uzma=5c1553b085a480.63440826; __uzmb=1544901552; __uzmc=632121030774; __uzmd=1544901552
// 'Connection: keep-alive'
// ),
// // CURLOPT_CONNECT_TO=>array('www.purplle.com:443:dumpinput.ratma.net:80'),
// CURLOPT_STDERR => $out,
// CURLOPT_VERBOSE => 1
// ) );
// var_dump ( $url );
curl_setopt ( $ch, CURLOPT_URL, $url );
$json = curl_exec ( $ch );
$data = json_decode ( $json, true );

// var_dump ($json, $data );
$title = $data ['list_title'];
echo 'title: ' . $title . "\n";
foreach ( $data ['items'] as $item ) {
    echo "name: ", $item ['name'], "\n";
}

输出：

title: hair fall shamboo
name: VLCC Hair fall Shampoo 350 ML (Buy1 Get1) & Ayurveda Hair Oil Combo (470 ml)
name: Biotique Bio Kelp Protein Shampoo For Falling Hair (190 ml)
name: Biotique Fresh Texture Shampoo - Bio Henna Leaf (120 ml)
name: Good Vibes Scalp Purifying Shampoo -Neem And Aloe Vera (200 ml)
name: Khadi Shikakai Sat Hair Cleanser Scalp Therapy (210 ml) By Swati Gramodyog
name: Good Vibes Apple Cider Vinegar Shampoo (120 ml)
name: Good Vibes Refreshing Shampoo - Green Apple (200 ml)
name: Good Vibes Hydrating Shampoo -Marigold (200 ml)
name: Alps Goodness Smoothening Shampoo - Keratin (50 ml)
name: Alps Goodness Softening Shampoo - Coconut & Almond (50 ml)
name: Alps Goodness Split End Control Shampoo - Coconut, Garlic & Shea Butter (50 ml)
name: Passion Indulge Papain Shampoo & Conditioner For Soft & Shiny Hair (200 ml + 100 ml)
name: Good Vibes Apple Cider Vinegar Shampoo (200 ml)
name: Alps Goodness Split End Control Shampoo - Coconut, Garlic & Shea Butter (200 ml)
name: Alps Goodness Nourishing Shampoo - Argan Oil & Olive (200 ml)
name: Alps Goodness Moisturizing Shampoo - Ginger & Egg (200 ml)
name: Alps Goodness Conditioning Shampoo - Pure Honey (200 ml)
name: Alps Goodness Hydrating Shampoo - Tea Tree (200 ml)
name: Alps Goodness Smoothening Shampoo - Keratin (200 ml)
name: Alps Goodness Softening Shampoo - Coconut & Almond (200 ml)
name: Good Vibes Scalp Purifying Shampoo -Neem And Aloe Vera (120 ml)
name: Good Vibes Hydrating Shampoo - Marigold (120 ml)
name: Alps Goodness Conditioning Shampoo - Pure Honey (50 ml)
name: Alps Goodness Moisturizing Shampoo - Ginger & Egg (50 ml)

无法从网页获取某些标题

2 个答案: