在Curl之后将表行转换为数组

时间:2015-10-07 15:59:43

标签: php arrays dom curl xpath

我正在寻找一种方法,用PHP将HTML表行存储到一个数组中,每个列值都是一个独特的数组值。

首先,我有一个完整的HTML页面,我从curl函数获得。在此页面中,我有一个具有特定ID(example_table)的表。

如何选择此表,然后将每个表值放入2坐标数组?

<table id="example_table">
    <tr><td>A1</td><td>B1</td><td>C1</td><td>D1</td></tr>
    <tr><td>A2</td><td>B2</td><td>C2</td><td>D2</td></tr>
    <tr><td>A3</td><td>B3</td><td>C3</td><td>D3</td></tr>
</table>

生成的数组如下:

array_example[2][3] = D3

//编辑:

我从curl获得的HTML代码如下:

<table style="width: 95%; border-collapse: collapse" id="itemDetails"> 
       <tbody>
        <tr> 
         <td class="photo" style="width: 150px; text-align: center; padding: 16px 0 10px 0; vertical-align: top; font-size: 12px; line-height: 18px; font-family: Arial, sans-serif"> <a href="https://www.amazon.fr/gp/r.html?C=11II60L0IUDTQ&amp;K=A37E83YVOBN2AM&amp;R=JC53DV4YW1VB&amp;T=C&amp;U=http%3A%2F%2Fwww.amazon.fr%2Fdp%2FB003629R5S%2Fref%3Dpe_386181_40444391_TE_item_image&amp;A=UOK26PXWANT3G9FAME6Z7XWZJVWA&amp;H=6B71WXRFQA1P9GFWS8UJRWK0VRAA&amp;ref_=pe_386181_40444391_TE_item_image" title="B003629R5S" style="text-decoration: none; color: rgb(0, 102, 153); font: 12px/ 16px Arial, sans-serif"> <img id="asin" src="http://ecx.images-amazon.com/images/I/31FSVzCchgL._SCLZZZZZZZ__SY115_SX115_.jpg" style="border: 0"> </a> </td> 
         <td class="name" style="color: rgb(102, 102, 102); padding: 10px 0 0 0; vertical-align: top; font-size: 12px; line-height: 18px; font-family: Arial, sans-serif"> <a href="https://www.amazon.fr/gp/r.html?C=11II60L0IUDTQ&amp;K=A37E83YVOBN2AM&amp;R=JC53DV4YW1VB&amp;T=C&amp;U=http%3A%2F%2Fwww.amazon.fr%2Fdp%2FB003629R5S%2Fref%3Dpe_386181_40444391_TE_item&amp;A=GNBXWEPQKFU3GEGJBGMMWYKA3K4A&amp;H=RXNWUWDFVKS3LQE1FENOQS4VDXCA&amp;ref_=pe_386181_40444391_TE_item" style="text-decoration: none; color: rgb(0, 102, 153); font: 12px/ 16px Arial, sans-serif"> Brabantia Lot de 12 rouleaux de 10 sacs poubelle Type L 45 l </a> <br> Etat : Neuf <br> Vendu par <a href="https://www.amazon.fr/gp/r.html?C=11II60L0IUDTQ&amp;K=A37E83YVOBN2AM&amp;R=JC53DV4YW1VB&amp;T=C&amp;U=http%3A%2F%2Fwww.amazon.fr%2Fgp%2Fhelp%2Fseller%2Fhome.html%2Fref%3Dpe_386181_40444391_TE_seller%3Fie%3DUTF8%26seller%3DA2ANA7NET4TQ0F&amp;A=AJJRA9DQK9EDVNDQDNAULH4KOC4A&amp;H=XH19ITMSWA3KJ0PSBTHLNQAFYAAA&amp;ref_=pe_386181_40444391_TE_seller" style="text-decoration: none; color: rgb(0, 102, 153); font: 12px/ 16px Arial, sans-serif">Perfect Groceries</a> <br> <a href="https://www.amazon.fr/gp/r.html?C=11II60L0IUDTQ&amp;K=A37E83YVOBN2AM&amp;R=JC53DV4YW1VB&amp;T=C&amp;U=http%3A%2F%2Fwww.amazon.fr%2Fexpedieparamazon%3Fref_%3Dpe_386181_40444391_TE_helpfba&amp;A=KEYAA7VCZNWVKEA7P2LYC49LKQMA&amp;H=W03OAAPQITJM5WD6MC5LG21OLVIA&amp;ref_=pe_386181_40444391_TE_helpfba" style="text-decoration: none; color: rgb(0, 102, 153); font: 12px/ 16px Arial, sans-serif">Expédié par Amazon</a> <br> 
          <div style="vertical-align: top; align=center;"> 
           <table border="0" cellspacing="4" cellpadding="0" style="border-collapse: separate"> 
            <tbody style="vertical-align: bottom;"> 
             <tr> 
              <td style="vertical-align: top; font-size: 12px; line-height: 18px; font-family: Arial, sans-serif"> </td> 
              <td style="vertical-align: top; font-size: 12px; line-height: 18px; font-family: Arial, sans-serif"> <a href="https://www.amazon.fr/gp/r.html?C=11II60L0IUDTQ&amp;K=A37E83YVOBN2AM&amp;R=JC53DV4YW1VB&amp;T=C&amp;U=http%3A%2F%2Fwww.amazon.fr%3A80%2Fgp%2Fredirect.html%2Fref%3Dpe_386181_40444391_cm_sw_cl_fa_doce%2F280-1861239-2544346%3F_encoding%3DUTF8%26location%3Dhttp%253A%252F%252Fwww.facebook.com%252Fdialog%252Ffeed%253Fapp_id%253D164734381262%2526caption%253D%2526display%253Dpopup%2526link%253Dhttp%25253A%25252F%25252Fwww.amazon.fr%25252Fdp%25252FB003629R5S%25252Fref%25253Dcm_sw_r_fa_doce%2526name%253D%2526picture%253Dhttp%25253A%25252F%25252Fecx.images-amazon.com%25252Fimages%25252FI%25252F31FSVzCchgL._SCLZZZZZZZ__SY115_SX115_.jpg%2526redirect_uri%253Dhttp%25253A%25252F%25252Fwww.amazon.fr%25252Fdp%25252FB003629R5S%25252Fref%25253Dcm_sw_r_fa_doce%26source%3Dstandards%26token%3D6BD0FB927CC51E76FF446584B1040F70EA7E88E1&amp;A=O66YJALVI4AECB8UEEBF4NGUHQQA&amp;H=PAUAVYQX28VPMP9DQELUI7PJWJWA&amp;ref_=pe_386181_40444391_cm_sw_cl_fa_doce" title="Facebook" style="text-decoration: none; color: rgb(0, 102, 153); font: 12px/ 16px Arial, sans-serif"> <img src="http://g-ecx.images-amazon.com/images/G/08/x-locale/personalization/live-meter/facebook._V15055984_.gif" width="16" alt="Facebook" style="vertical-align: middle; border: 0" height="16" border="0"> </a> </td> 
              <td style="vertical-align: top; font-size: 12px; line-height: 18px; font-family: Arial, sans-serif"> <a href="https://www.amazon.fr/gp/r.html?C=11II60L0IUDTQ&amp;K=A37E83YVOBN2AM&amp;R=JC53DV4YW1VB&amp;T=C&amp;U=http%3A%2F%2Fwww.amazon.fr%3A80%2Fgp%2Fredirect.html%2Fref%3Dpe_386181_40444391_cm_sw_cl_tw_doce%2F280-1861239-2544346%3F_encoding%3DUTF8%26location%3Dhttp%253A%252F%252Ftwitter.com%252Fshare%253Fcount%253Dnone%2526original_referer%253Dhttp%25253A%25252F%25252Fwww.amazon.fr%25252Fdp%25252FB003629R5S%25252Fref%25253Dcm_sw_r_tw_doce%2526related%253Damazon%25252Camazondeals%25252Camazonmp3%2526text%253DBrabantia%252520Lot%252520de%25252012%252520rouleaux%252520de%25252010%252520sacs%252520poubelle%252520Type%252520L%25252045%252520l%252520sur%252520Amazon%2526twitterURL%253Dhttp%25253A%25252F%25252Fwww.amazon.fr%25252Fdp%25252FB003629R5S%25252Fref%25253Dcm_sw_r_tw_doce%2526via%253Damazon%26source%3Dstandards%26token%3D7A1A4AE8F6CE0BD277D8295E58702D283F329C0F&amp;A=KPDO6A0PIPKRQL84ARGCMAOOCASA&amp;H=TA6BYC0F3HFJPCCQIIOCPYIGFAGA&amp;ref_=pe_386181_40444391_cm_sw_cl_tw_doce" title="Twitter" style="text-decoration: none; color: rgb(0, 102, 153); font: 12px/ 16px Arial, sans-serif"> <img src="http://g-ecx.images-amazon.com/images/G/08/x-locale/communities/social/twitter._V388040480_.gif" width="16" alt="Twitter" style="vertical-align: middle; border: 0" height="16" border="0"> </a> </td> 
              <td style="vertical-align: top; font-size: 12px; line-height: 18px; font-family: Arial, sans-serif"> <a href="https://www.amazon.fr/gp/r.html?C=11II60L0IUDTQ&amp;K=A37E83YVOBN2AM&amp;R=JC53DV4YW1VB&amp;T=C&amp;U=http%3A%2F%2Fwww.amazon.fr%3A80%2Fgp%2Fredirect.html%2Fref%3Dpe_386181_40444391_cm_sw_cl_pi_doce%2F280-1861239-2544346%3F_encoding%3DUTF8%26location%3Dhttp%253A%252F%252Fpinterest.com%252Fpin%252Fcreate%252Fbutton%252F%253Fdescription%253DBrabantia%252520Lot%252520de%25252012%252520rouleaux%252520de%25252010%252520sacs%252520poubelle%252520Type%252520L%25252045%252520l%252520sur%252520Amazon%25252C%252520http%25253A%25252F%25252Fwww.amazon.fr%25252Fdp%25252FB003629R5S%25252Fref%25253Dcm_sw_r_pi_doce%2526is_video%253Dfalse%2526media%253Dhttp%25253A%25252F%25252Fecx.images-amazon.com%25252Fimages%25252FI%25252F31FSVzCchgL._SCLZZZZZZZ__SY115_SX115_.jpg%2526title%253D%2526url%253Dhttp%25253A%25252F%25252Fwww.amazon.fr%25252Fdp%25252FB003629R5S%25252Fref%25253Dcm_sw_r_pi_doce%26source%3Dstandards%26token%3D9F58B366258E1A8B5259E9BEF3482E02341F42D3&amp;A=RDONF9RAZWJSW6DTDZM6CAUCAXAA&amp;H=GEAUNFZ4QS9J5KE00AWBWWLX81UA&amp;ref_=pe_386181_40444391_cm_sw_cl_pi_doce" title="Pinterest" style="text-decoration: none; color: rgb(0, 102, 153); font: 12px/ 16px Arial, sans-serif"> <img src="http://g-ecx.images-amazon.com/images/G/08/x-locale/communities/social/pinterest._V389372180_.png" width="16" alt="Pinterest" style="vertical-align: middle; border: 0" height="16" border="0"> </a> </td> 
             </tr> 
            </tbody> 
           </table> 
          </div> </td> 
         <td class="price" style="width: 80px; text-align: right; font-size: 14px; padding: 10px 10px 0 0; vertical-align: top; line-height: 18px; font-family: Arial, sans-serif"> <strong>EUR 59,99</strong> <br> </td> 
        </tr> 
       </tbody>
      </table>

1 个答案:

答案 0 :(得分:1)

您示例中的表格数据单元格除了某些空格外没有任何文本内容。它们具有带属性的子元素,但我想你想提取它们的数据。

使用DOM + Xpath。 DOM可以加载HTML(它将修复错误并可能改变结构)。 DOMXpath::evaluate()允许您从DOM中获取节点列表和标量值。 Xpath表达式用于处理DOM内的节点。

$document = new DOMDocument();
$document->loadHtml($html);
$xpath = new DOMXpath($document);

$result = [];
foreach($xpath->evaluate('//table[@id="itemDetails"]//table/tbody/tr') as $tr) {
  $row = [];
  foreach ($xpath->evaluate('td[a]', $tr) as $td) {
    $row[] = [
      'href' => $xpath->evaluate('string(a/@href)', $td),
      'image' => $xpath->evaluate('string(a/img/@src)', $td),
      'text' => $xpath->evaluate('string(a/img/@alt)', $td)
    ];
  }
  $result[] = $row; 
}

var_dump($result);

输出:

array(1) {
  [0]=>
  array(3) {
    [0]=>
    array(3) {
      ["href"]=>
      string(908) "https://www...."
      ["image"]=>
      string(103) "http://g-ecx..."
      ["text"]=>
      string(8) "Facebook"
    }
    [1]=>...