我正在使用Simple HTML Dom Parser向Google查询特定关键字,然后循环浏览内容。但是,我不想查询广告或新闻框。由于列表元素具有不同的类,但newsbox li元素具有相同的类但具有附加ID,因此很容易排除广告。
结果li元素
<li class="g">...</li>
Newsbox li元素
<li class="g" id="newsbox">...</li>
如何使用ID新闻框排除li元素?
我在这里读了一遍,根据另一个人的建议,这是我最接近但是它没有工作:
$query = file_get_html('https://google.com/search?q=test');
$li_elements = $query->find('li[class=g id!=newsbox]');
之前有任何其他想法或某人解决了这个问题吗?
我仍然在努力,我几乎走到了尽头。这是我最新的代码:
include('simple_html_dom.php');
$html = file_get_html('https://www.google.co.uk/search?q=football');
// Find all article blocks
foreach($html->find('#res h3.r') as $article) {
$item['title'] = $article->plaintext;
$item['intro'] = $article->find('a', 0)->href;
$articles[] = $item;
}
print_r($articles);
这是打印的数组
Array
(
[0] => Array
(
[title] => BBC Sport - Football
[intro] => /url?q=http://www.bbc.co.uk/sport/0/football/&sa=U&ei=NkblU-s8h6nQBcCJgOAI&ved=0CBQQFjAA&usg=AFQjCNGHTFqXJoRjHKBSCdKFiW_BX6eGDw
)
[1] => Array
(
[title] => News for football
[intro] => /search?q=football&ie=UTF-8&prmd=ivnsl&source=univ&tbm=nws&tbo=u&sa=X&ei=NkblU-s8h6nQBcCJgOAI&ved=0CB8QqAI
)
[2] => Array
(
[title] => Football Games, Results, Scores, Transfers, News | Sky Sports
[intro] => /url?q=http://www1.skysports.com/football/&sa=U&ei=NkblU-s8h6nQBcCJgOAI&ved=0CCgQFjAE&usg=AFQjCNE4VP4WAHIYJAoPIBJoUx1pC-1jBA
)
[3] => Array
(
[title] => Local business results for football near London NW5
[intro] => https://maps.google.co.uk/maps?um=1&ie=UTF-8&fb=1&gl=uk&q=football&hq=football&hnear=0x48761a535791ef6f:0x493f677c231558c8,London+NW5&sa=X&ei=NkblU-s8h6nQBcCJgOAI&ved=0CC4QtQM
)
[4] => Array
(
[title] => Football news, match reports and fixtures | Football | The Guardian
[intro] => /url?q=http://www.theguardian.com/football&sa=U&ei=NkblU-s8h6nQBcCJgOAI&ved=0CE4QFjAM&usg=AFQjCNHPhgIljb53cFPRHlb1vCa1fmWJag
)
[5] => Array
(
[title] => NewsNow: Football News | Breaking News & Search 24/7
[intro] => /url?q=http://www.newsnow.co.uk/h/Sport/Football&sa=U&ei=NkblU-s8h6nQBcCJgOAI&ved=0CFQQFjAN&usg=AFQjCNEmmlrEayvHdebKTfPykGhHxRioLA
)
[6] => Array
(
[title] => Football365 - Football News, Views, Gossip and much more...
[intro] => /url?q=http://www.football365.com/&sa=U&ei=NkblU-s8h6nQBcCJgOAI&ved=0CFoQFjAO&usg=AFQjCNFKIP3xgtxw9DhNtOhVfpT4pbpLPw
)
[7] => Array
(
[title] => Football - Wikipedia, the free encyclopedia
[intro] => /url?q=http://en.wikipedia.org/wiki/Football&sa=U&ei=NkblU-s8h6nQBcCJgOAI&ved=0CGAQFjAP&usg=AFQjCNF2Fk8WH4rzEvWzmYIEUycZnjvjpg
)
[8] => Array
(
[title] => Football in London - Things To Do - visitlondon.com
[intro] => /url?q=http://www.visitlondon.com/things-to-do/whats-on/sport/football&sa=U&ei=NkblU-s8h6nQBcCJgOAI&ved=0CGYQFjAQ&usg=AFQjCNEdSNJc-mlVpaWEY9yPjcoDSaDLIw
)
[9] => Array
(
[title] => London Football Leagues - 5-a-side - 7-a-side - 11-a-side - Midweek ...
[intro] => /url?q=http://www.londonfootball.co.uk/&sa=U&ei=NkblU-s8h6nQBcCJgOAI&ved=0CHMQFjAR&usg=AFQjCNGnZtZQxUmUYQtDF0Tj5nJRnR2Yig
)
[10] => Array
(
[title] => Football Tickets and Event Details | Ticketmaster UK Sport
[intro] => /url?q=http://www.ticketmaster.co.uk/browse/football-catid-11/sport-rid-10004&sa=U&ei=NkblU-s8h6nQBcCJgOAI&ved=0CHkQFjAS&usg=AFQjCNFwTfpq-klboIEf0EbhlMQWvzHeKQ
)
)
我不明白为什么第二个结果array[1][title]
存储在数组中,因为根据这一行$html->find('#res h3.r') as $article
它不应该存在。它既不包含在id #res的div中,也不包含在h3标签内。
有什么想法吗?
答案 0 :(得分:0)
不幸的是,简单的HTML Dom Parser不支持这种灵活性,但是可以找到一个可行的方法......
您可以先删除不需要的块,然后检索正确的块:
$query->find('li#newsbox', 0)->outertext = '';
$li_elements = $query->find('li.g');
以下是显示其工作原理的示例代码:
$input = <<<_DATA_
<div class="g" id="newsbox">Bad node</div>
<div class="g">Usefull node</div>
_DATA_;
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($input);
// Remove the bad node
$html->find('div#newsbox', 0)->outertext = ''; // Comment this line to print the original html content
echo $html;
答案 1 :(得分:0)
simple_html_dom声称支持它,所以它似乎是一个错误。
选择li.g:not(#newsbox)
的正确css方式不是简单支持,而是由this one支持。