外部内容通过课堂

时间:2017-05-22 11:57:19

标签: php

我成功使用以下代码从表类接收外部内容。

$url = 'https://www.anything.com';
$content = file_get_contents($url);
$first_step = explode( '<table class="main">' , $content );
$second_step = explode("</table>" , $first_step[1] );

echo $second_step[0];

现在我需要<a class="link">content</a>

的内容
$url = 'https://www.anything.com';
$content = file_get_contents($url);
$first_step = explode( '<a class="link">' , $content );
$second_step = explode("</a>" , $first_step[1] ); 

不起作用。

同时我使用此代码

    // Create DOM from URL or file

    $sFilex = file_get_html("https://www.anything.com", False, $cxContext);

    // Find all links
    foreach($sFilex->find('a[class=link]') as $element)
    echo $element->href . '<br>';

成功获取所有<a class="link">content</a>个链接。但是怎么可能 我将此限制为仅限第一个找到的结果?

的正确代码
<a class="link" id="55834" href="/this/is/a/test">this is a test</a>

感谢您的帮助!

1 个答案:

答案 0 :(得分:1)

由于我建议使用正确的HTML解析器,这对于没有经验的人来说可能有点吓人,我想我可以给你一个例子,开头用:

$url = 'https://www.anything.com';

// create a new DOMDocument (an XML/HTML parser)
$doc = new DOMDocument;
// this is used to repair possibly malformed HTML
$doc->recover = true;

// libxml is the parse library that DOMDocument internally uses
// put errors in a memory buffer, in stead of outputting them immediately (basically ignore them, until you need them, if ever)
libxml_use_internal_errors( true );

// load the external URL; this might not work if retrieving external files is disabled.
// I will come back on that if it doesn't work for you.
$doc->loadHTMLFile( $url );

// xpath is a query language that allows you to query XML/HTML data structures.
// we create an DOMXPath instance that operates on the earlier created DOMDocument
$xpath = new DOMXPath( $doc );

// this is a query to get all <table class="main">
// note though, that it will also match <table class="test maintain">, etc.
// which might not be what you need
$tableMainQuery = '//table[contains(@class,"main")]';
/* explanation:
   //         match any descendant of the current context, in this case root
   table      match <table> elements
   []         with the predicate(s)
   contains() match a string, that contains some string, in this case:
   @class     the attribute 'class'
   'main'     containing the string main
*/   

// execute the query
// $nodes will be an instance of DOMNodeList (containing DOMNode instances)
$nodes = $xpath->query( $tableMainQuery );

// loop through all nodes
foreach( $nodes as $node ) {
  // echo the inner HTML content of the found node (or do something else with it)
  // the getInnerHTML() helper function is defined below)
  // remove htmlentities to get the actual HTML
  echo htmlentities( getInnerHTML( $node ) );
}

// this is a query to get all <a class="link">
// similar comments and explanation apply as with previous query
$aLinkQuery = '//a[contains(@class,"link")]';

// execute the query
// $nodes will be an instance of DOMNodeList (containing DOMNode instances)
$nodes = $xpath->query( $aLinkQuery );

// loop through all nodes
foreach( $nodes as $node ) {
  // do something with the found nodes again
}

// clear any errors still left in memory
libxml_clear_errors();
// set previous state
libxml_use_internal_errors( $useInternalErrors );

// the helper function to get the inner HTML of a found node
function getInnerHTML( DOMNode $node ) {
  $html = '';
  foreach( $node->childNodes as $childNode ) {
    $html .= $childNode->ownerDocument->saveHTML( $childNode );
  }

  return $html;
}

现在,为了只获取xpath查询的第一个找到的节点(DOMNodeList实例),我认为最简单的是:

// in both the examples below $node will contain the element you are looking for
// $nodes will keep being a list of all found nodes

if( $nodes->length > 0 ) {
  $node = $nodes->item( 0 );
  // do something with the $node
}

// or, perhaps
if( null !== ( $node = $nodes->item( 0 ) ) ) {
  // do something with the $node
}

您还可以调整xpath查询以仅查找第一个匹配的节点,但我相信它仍会返回DOMNodeList