Question

我想从http://www.ifanca.org/Pages/Certified-Products.aspx?search=22535抓取数据。这是我的PHP脚本：

<?php
 //get the html returned from the following url
$html = file_get_contents(
  'http://www.ifanca.org/Pages/Certified-Products.aspx?search=22535');

$pokemon_doc = new DOMDocument();

libxml_use_internal_errors(TRUE); //disable libxml errors

if(!empty($html)) { //if any html is actually returned

  $pokemon_doc->loadHTML($html);
  libxml_clear_errors(); //remove errors for yucky html

  $pokemon_xpath = new DOMXPath($pokemon_doc);

  $pokemon_row = $pokemon_xpath->query('//*[@id="example"]');

  if($pokemon_row->length > 0){
    foreach($pokemon_row as $row){
      echo $row->nodeValue . "<br/>";
    }
  }
}
?>

它给了我结果：

产品名称公司名称销售在营销类型产品类型产品代码徽标Ifanca代码

哪个好。但是当我试图通过查询//*[@id="example"]/tbody/tr/td[1]来获取产品名称，例如“4Life Transfer Factor Belle Vie”时，它就什么也没给我。

Screenshot

我需要帮助来获取产品名称数据。

Answer 1

如果你wget该文件并检查它的内容，你会发现所有内容都是用javascript实现的，而表的初始HTML是：

<table id="example" class="display"  
       width="100%" cellpadding="0" cellspacing="0" border="0">

  <thead>
    <tr><th width="22%" style="width:22% !important">Company Name </th>
        <th width="13%" style="width:13% !important">Sold In</th>
        <th width="23%" style="width:23% !important">Product Name</th>h>
        <th width="22%" style="width:22% !important">Company Name </th>
        <th width="13%" style="width:13% !important">Sold In</th></th>
        <th width="10%" style="width:10% !important">Marketing Type</th>
        <th width="10%" style="width:10% !important">Product Type</th>
        <th width="10%" style="width:10% !important">Product Code</th>
        <th width="5%" style="width:5% !important" >Logo</th>
        <th width="7%" style="width:7% !important">Ifanca Code</th>
  </thead>
  <tbody>
  </tbody>
</table>

file_get_contents和DOMDocument都不会为您解析和执行javascript。这就是为什么你为

收获一个空结果集的原因

//*[@id="example"]/tbody/tr/td[1]

它只是在结果文档中不存在。

Answer 2

此网站依赖于JavaScript。如果您在页面加载时打开网络开发人员工具（在Firefox和大多数其他浏览器中），您将看到它向服务器生成四个AJAX POST请求。很可能每一个都依赖于另一个，所以刮掉这些可能并不容易。

通常我建议抓取AJAX GET请求，因为每个数据源只有（并且应该）只有一个，但是这个站点以浪费HTTP资源的方式获取内容，并且难以获取。实际上，这可能是开发人员这样做的原因 - 他们不希望其他人重新发布他们的信息。

其中一个请求的输入参数采用此XML：

<?xml version="1.0" encoding="UTF-8"?>
<Request xmlns="http://schemas.microsoft.com/sharepoint/clientquery/2009" SchemaVersion="15.0.0.0" LibraryVersion="15.0.0.0" ApplicationName="Javascript Library">
   <Actions>
      <ObjectPath Id="1" ObjectPathId="0" />
      <ObjectPath Id="3" ObjectPathId="2" />
      <ObjectPath Id="5" ObjectPathId="4" />
      <ObjectPath Id="7" ObjectPathId="6" />
      <ObjectIdentityQuery Id="8" ObjectPathId="6" />
      <ObjectPath Id="10" ObjectPathId="9" />
      <ObjectPath Id="12" ObjectPathId="11" />
      <ObjectIdentityQuery Id="13" ObjectPathId="11" />
      <ObjectPath Id="15" ObjectPathId="14" />
      <Query Id="16" ObjectPathId="9">
         <Query SelectAllProperties="true">
            <Properties />
         </Query>
         <ChildItemQuery SelectAllProperties="true">
            <Properties />
         </ChildItemQuery>
      </Query>
   </Actions>
   <ObjectPaths>
      <StaticProperty Id="0" TypeId="{3747adcd-a3c3-41b9-bfab-4a64dd2f1e0a}" Name="Current" />
      <Property Id="2" ParentId="0" Name="Web" />
      <Property Id="4" ParentId="2" Name="Lists" />
      <Method Id="6" ParentId="4" Name="GetByTitle">
         <Parameters>
            <Parameter Type="String">HCM</Parameter>
         </Parameters>
      </Method>
      <Method Id="9" ParentId="6" Name="GetItems">
         <Parameters>
            <Parameter TypeId="{3d248d7b-fc86-40a3-aa97-02a75d69fb8a}">
               <Property Name="DatesInUtc" Type="Boolean">true</Property>
               <Property Name="FolderServerRelativeUrl" Type="Null" />
               <Property Name="ListItemCollectionPosition" Type="Null" />
               <Property Name="ViewXml" Type="String">&lt;View Scope="RecursiveAll"&gt;&lt;Query&gt;&lt;Where&gt;&lt;And&gt;&lt;IsNotNull&gt;&lt;FieldRef Name="Year"/&gt;&lt;/IsNotNull&gt;&lt;In&gt;&lt;FieldRef Name="FileType"/&gt;&lt;Values&gt;&lt;Value Type="Choice"&gt;Image&lt;/Value&gt;&lt;Value Type="Choice"&gt;Flipbook&lt;/Value&gt;&lt;Value Type="Choice"&gt;pdf&lt;/Value&gt;&lt;/Values&gt;&lt;/In&gt;&lt;/And&gt;&lt;/Where&gt;&lt;OrderBy&gt;&lt;FieldRef Name="IssueNo" Ascending="False" /&gt;&lt;/OrderBy&gt;&lt;/Query&gt;&lt;RowLimit&gt;10&lt;/RowLimit&gt;&lt;/View&gt;</Property>
            </Parameter>
         </Parameters>
      </Method>
      <Method Id="11" ParentId="4" Name="GetByTitle">
         <Parameters>
            <Parameter Type="String">HDNL</Parameter>
         </Parameters>
      </Method>
      <Method Id="14" ParentId="11" Name="GetItems">
         <Parameters>
            <Parameter TypeId="{3d248d7b-fc86-40a3-aa97-02a75d69fb8a}">
               <Property Name="DatesInUtc" Type="Boolean">true</Property>
               <Property Name="FolderServerRelativeUrl" Type="Null" />
               <Property Name="ListItemCollectionPosition" Type="Null" />
               <Property Name="ViewXml" Type="String">&lt;View Scope="RecursiveAll"&gt;&lt;Query&gt;&lt;Where&gt;&lt;IsNotNull&gt;&lt;FieldRef Name="YYYY"/&gt;&lt;/IsNotNull&gt;&lt;/Where&gt;&lt;OrderBy&gt;&lt;FieldRef Name="IssueNumber" Ascending="False" /&gt;&lt;/OrderBy&gt;&lt;/Query&gt;&lt;RowLimit&gt;3&lt;/RowLimit&gt;&lt;/View&gt;</Property>
            </Parameter>
         </Parameters>
      </Method>
   </ObjectPaths>
</Request>

糟糕！如果您想通过发送类似文档来构建请求，那么您必须制定格式。我怀疑在这里使用无头浏览器会更容易，例如PhantomJS。有这样的PHP驱动程序，例如Spiderling。这将为您运行JavaScript（在现代Webkit浏览器上），您将能够使用XPath或CSS选择器检索数据。

（请记住，其他网站上的数据可能受版权保护。您可能会遇到设置刮刀的麻烦，只是发现您是IP阻止的目标，或者更糟糕的是，法律行动。刮擦的错误相当复杂，但我的简短建议是，如果你能从一系列目标中掠夺，它会使你的项目不易失败）。

Answer 3

我使用DIFFBOT Article API解决了我的问题，API的链接是https://www.diffbot.com。

从JavaScript驱动的网站上搜索数据

3 个答案: