我成功使用以下代码从表类接收外部内容。
$url = 'https://www.anything.com';
$content = file_get_contents($url);
$first_step = explode( '<table class="main">' , $content );
$second_step = explode("</table>" , $first_step[1] );
echo $second_step[0];
现在我需要<a class="link">content</a>
但
$url = 'https://www.anything.com';
$content = file_get_contents($url);
$first_step = explode( '<a class="link">' , $content );
$second_step = explode("</a>" , $first_step[1] );
不起作用。
同时我使用此代码
// Create DOM from URL or file
$sFilex = file_get_html("https://www.anything.com", False, $cxContext);
// Find all links
foreach($sFilex->find('a[class=link]') as $element)
echo $element->href . '<br>';
成功获取所有<a class="link">content</a>
个链接。但是怎么可能
我将此限制为仅限第一个找到的结果?
的正确代码
<a class="link" id="55834" href="/this/is/a/test">this is a test</a>
感谢您的帮助!
答案 0 :(得分:1)
由于我建议使用正确的HTML解析器,这对于没有经验的人来说可能有点吓人,我想我可以给你一个例子,开头用:
$url = 'https://www.anything.com';
// create a new DOMDocument (an XML/HTML parser)
$doc = new DOMDocument;
// this is used to repair possibly malformed HTML
$doc->recover = true;
// libxml is the parse library that DOMDocument internally uses
// put errors in a memory buffer, in stead of outputting them immediately (basically ignore them, until you need them, if ever)
libxml_use_internal_errors( true );
// load the external URL; this might not work if retrieving external files is disabled.
// I will come back on that if it doesn't work for you.
$doc->loadHTMLFile( $url );
// xpath is a query language that allows you to query XML/HTML data structures.
// we create an DOMXPath instance that operates on the earlier created DOMDocument
$xpath = new DOMXPath( $doc );
// this is a query to get all <table class="main">
// note though, that it will also match <table class="test maintain">, etc.
// which might not be what you need
$tableMainQuery = '//table[contains(@class,"main")]';
/* explanation:
// match any descendant of the current context, in this case root
table match <table> elements
[] with the predicate(s)
contains() match a string, that contains some string, in this case:
@class the attribute 'class'
'main' containing the string main
*/
// execute the query
// $nodes will be an instance of DOMNodeList (containing DOMNode instances)
$nodes = $xpath->query( $tableMainQuery );
// loop through all nodes
foreach( $nodes as $node ) {
// echo the inner HTML content of the found node (or do something else with it)
// the getInnerHTML() helper function is defined below)
// remove htmlentities to get the actual HTML
echo htmlentities( getInnerHTML( $node ) );
}
// this is a query to get all <a class="link">
// similar comments and explanation apply as with previous query
$aLinkQuery = '//a[contains(@class,"link")]';
// execute the query
// $nodes will be an instance of DOMNodeList (containing DOMNode instances)
$nodes = $xpath->query( $aLinkQuery );
// loop through all nodes
foreach( $nodes as $node ) {
// do something with the found nodes again
}
// clear any errors still left in memory
libxml_clear_errors();
// set previous state
libxml_use_internal_errors( $useInternalErrors );
// the helper function to get the inner HTML of a found node
function getInnerHTML( DOMNode $node ) {
$html = '';
foreach( $node->childNodes as $childNode ) {
$html .= $childNode->ownerDocument->saveHTML( $childNode );
}
return $html;
}
现在,为了只获取xpath查询的第一个找到的节点(DOMNodeList
实例),我认为最简单的是:
// in both the examples below $node will contain the element you are looking for
// $nodes will keep being a list of all found nodes
if( $nodes->length > 0 ) {
$node = $nodes->item( 0 );
// do something with the $node
}
// or, perhaps
if( null !== ( $node = $nodes->item( 0 ) ) ) {
// do something with the $node
}
您还可以调整xpath查询以仅查找第一个匹配的节点,但我相信它仍会返回DOMNodeList
。