如何解析使用CURL获取的数据来获取DL?

时间:2015-03-31 20:20:24

标签: php curl

我想显示期刊列表及其缩写,如:

期刊名称,缩写

我从以下网址获取数据:  http://images.webofknowledge.com/WOK46/help/WOS/D_abrvjt.html 所以我正在运行以下内容:

$ ch = curl_init();

//Set options
 $curl = curl_init();
 curl_setopt_array($curl, array(
 CURLOPT_URL =>           'http://images.webofknowledge.com/WOK46/help/WOS/A_abrvjt.html'
 ));
  $result = curl_exec($curl);
 curl_close($curl);

 $data=json_decode($result, true);
 //!End function, make_call

但现在它告诉我的是整个页面,但正如我所说,我只需要期刊的名称(dt)和缩写(dd)。那么如何解析结果呢?

1 个答案:

答案 0 :(得分:1)

通过Simple HTML DOM进行HTML DOM解析 刮痧法......

<?php

Function Scraper($file, $cnt = NULL) {
    /*
      @param $file, url or path/file
      @param $cnt, (number of results to list) empty for all, or number
    */
    require_once('PATH/TO/simple_html_dom.php');
    //set_time_limit(0); // uncomment for large files
    $result = array();

    // Create DOM from URL
    $html = file_get_html($file);
    IF ($html) {
        IF (empty($cnt)) { $cnt = count($html->find('DT')); }

        foreach($html->find('DL') as $dl) {

            for ($i = 0; $i < $cnt; $i++) {
                $dt = $dl->find('DT', $i)->plaintext;
                $dd = $dl->find('DD', $i)->plaintext;
                $result[] = array(trim($dt) => trim($dd));
            }

        }
    }

    return $result;

}

$array = Scraper('http://somesite.com/page.html');
print_r($array);
?>

示例输出......

Array
(
    [0] => Array
        (
            [D H LAWRENCE REVIEW] => D H LAWRENCE REV
        )

    [1] => Array
        (
            [D-D EXCITATIONS IN TRANSITION-METAL OXIDES] => SPRINGER TR MOD PHYS
        )

    [2] => Array
        (
            [DADOS-REVISTA DE CIENCIAS SOCIAIS] => DADOS-REV CIENC SOC
        )

    [3] => Array
        (
            [DAEDALUS] => DAEDALUS
        )

    [4] => Array
        (
            [DAEDALUS] => DAEDALUS-US
        )

    [5] => Array
        (
            [DAGHESTAN AND THE WORLD OF ISLAM] => SUOMAL TIED TOIM SAR
        )

)

更新了针对user350082问题的示例......

定义列表DT和DD标签未关闭,导致dd包含在find(&#39; dt&#39;)结果中。

<DT>D H LAWRENCE REVIEW<B><DD>  D H LAWRENCE REV</B>
<DT>D-D EXCITATIONS IN TRANSITION-METAL OXIDES<B><DD>   SPRINGER TR MOD PHYS</B>
etc. etc. etc.

更新功能......

Function Scraper($file, $cnt = NULL) {

    /*
      @param $file, url or path/file
      @param $cnt, (number of results to list) empty for all, or number
    */
    require_once('PATH/TO/simple_html_dom.php');
    //set_time_limit(0); // uncomment for large files
    $result = array();

    // Create DOM from URL
    $html = file_get_html($file);
    IF ($html) {

        foreach($html->find('DL') as $dl) {

            IF (empty($cnt)) { $cnt = count($html->find('DT')); } // set count if null
            for ($i = 0; $i < $cnt; $i++) {

                $dd = $dl->find('DD', $i)->plaintext;

                $dt = $dl->find('DT', $i)->innertext; // dt with html tags, easier for removing dd duplication
                $dt = preg_replace('/\s+/', ' ',$dt); // remove extra whitespace, tabs etc.

                // strip DD text duplication from DT
                IF (($pos = strrpos($dt ,$dd)) !== false) {
                    $strlen = strlen($dd);
                    $dt = substr_replace($dt, "", $pos, $strlen);
                }

                $dt = strip_tags($dt); // remove html tags
                IF (empty($dt)) { $dt = $dd; } // make sure dt is not empty

                $result[] = array(trim($dt) => trim($dd));

            }

        }

    }

    return $result;

}