Question

您好我正在尝试使用Simple HTML DOM Parser来抓取UFC事件计划。

我正在努力选择正确的数据。

我想要标题，图像，日期，时间＆amp;位置。

到目前为止，我已经尝试了

function scraping_ufc() {
    // create HTML DOM
    $html = file_get_html('http://uk.ufc.com/schedule/event/');

    // get news block
    foreach($html->find('table tr') as $event) {
        // get title
        $item['title'] = trim($event->find('div[class="event-tagline"]', 0)->innertext);
        // get details
        $item['date'] = trim($event->find('div[class="date"]', 0)->innertext);

        $item['time'] = trim($event->find('div[class="time"]', 0)->innertext);

        $ret[] = $item;
    }


    // clean up memory
    $html->clear();
    unset($html);

    return $ret;
}

我选择了很多不需要的表格行，我确实设法得到了标题而不是日期或时间。

请帮助我有效地选择我需要的数据。

Answer 1

首先，停止使用简单的html dom，因为它不如内置dom库可靠。它在几年前很有用，但是现在它确实只会导致比它解决的更多问题。

$dom = new DOMDocument();
@$dom->loadHTMLFile('http://uk.ufc.com/schedule/event/');
$xpath = new DOMXPath($dom);

接下来，您需要一种更好的方法来识别所需的行。 table tr将选择页面上的每个tr，但您不希望这样。如果tr的风格很好，但它们并非如此，我想出了这个：

foreach($xpath->query('//td[@class="upcoming-events-image"]/..') as $tr){
  $item['title'] = $xpath->query('.//div[@class="event-tagline"]/a', $tr)->item(0)->nodeValue;
  $item['date'] = $xpath->query('.//div[@class="date"]', $tr)->item(0)->nodeValue;
  $item['time'] = $xpath->query('.//div[@class="time"]', $tr)->item(0)->nodeValue;
  $ret[] = $item;
}

使用简单的HTML DOM解析器进行报废

1 个答案: