如何从.html页面中提取链接和标题?

时间:2010-12-12 18:35:59

标签: php html string hyperlink web-crawler

对于我的网站,我想添加一项新功能。

我希望用户能够上传他的书签备份文件(如果可能的话,从任何浏览器上传),这样我就可以将其上传到他们的个人资料中,而且他们不必手动插入所有这些...

我唯一缺少这样做的部分是从上传的文件中提取标题和URL的部分..任何人都可以提供线索从哪里开始或在哪里阅读?

使用了搜索选项和(how to extract data from a raw html file)这是我最相关的问题,它没有谈论它..

我真的不介意它是否使用jquery或php

非常感谢

7 个答案:

答案 0 :(得分:53)

谢谢大家,我知道了!

最终代码: 这会显示已分配的锚点文本以及.html文件中所有链接的 href

$html = file_get_contents('bookmarks.html');
//Create a new DOM document
$dom = new DOMDocument;

//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);

//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');

//Iterate over the extracted links and display their URLs
foreach ($links as $link){
    //Extract and show the "href" attribute.
    echo $link->nodeValue;
    echo $link->getAttribute('href'), '<br>';
}

再次,非常感谢。

答案 1 :(得分:33)

这可能就足够了:

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node)
{
  echo $node->nodeValue.': '.$node->getAttribute("href")."\n";
}

答案 2 :(得分:5)

假设存储的链接在html文件中,最好的解决方案可能是使用html解析器,例如PHP Simple HTML DOM Parser(我自己从未尝试过)。 (另一种选择是使用基本字符串搜索或regexp进行搜索,你可能从不使用regexp来解析html)。

使用解析器读取html文件后,使用它的函数来查找a标记:

来自教程:

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 

答案 3 :(得分:3)

这是一个例子,您可以在这种情况下使用:

$content = file_get_contents('bookmarks.html');

运行:

<?php

$content = '<html>

<title>Random Website I am Crawling</title>

<body>

Click <a href="http://clicklink.com">here</a> for foobar

Another site is http://foobar.com

</body>

</html>';

$regex = "((https?|ftp)\:\/\/)?"; // SCHEME
$regex .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass
$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
$regex .= "(\:[0-9]{2,5})?"; // Port
$regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query
$regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor


$matches = array(); //create array
$pattern = "/$regex/";

preg_match_all($pattern, $content, $matches); 

print_r(array_values(array_unique($matches[0])));
echo "<br><br>";
echo implode("<br>", array_values(array_unique($matches[0])));

输出:

Array
(
    [0] => http://clicklink.com
    [1] => http://foobar.com
)
  

http://clicklink.com

     

http://foobar.com

答案 4 :(得分:1)

$html = file_get_contents('your file path');

$dom = new DOMDocument;

@$dom->loadHTML($html);

$styles = $dom->getElementsByTagName('link');

$links = $dom->getElementsByTagName('a');

$scripts = $dom->getElementsByTagName('script');

foreach($styles as $style)
{

    if($style->getAttribute('href')!="#")

    {
        echo $style->getAttribute('href');
        echo'<br>';
    }
}

foreach ($links as $link){

    if($link->getAttribute('href')!="#")
    {
        echo $link->getAttribute('href');
        echo'<br>';
    }
}

foreach($scripts as $script)
{

        echo $script->getAttribute('src');
        echo'<br>';

}

答案 5 :(得分:0)

我想从html页面创建CSV链接路径及其文本,以便可以从站点中翻录菜单等。

在此示例中,您指定了您感兴趣的域,这样您就不会离开站点链接,然后为每个文档生成CSV

/**
 * Extracts links to the given domain from the files and creates CSVs of the links
 */


$LinkExtractor = new LinkExtractor('https://www.example.co.uk');

$LinkExtractor->extract(__DIR__ . '/hamburger.htm');
$LinkExtractor->extract(__DIR__ . '/navbar.htm');
$LinkExtractor->extract(__DIR__ . '/footer.htm');

class LinkExtractor {
    public $domain;

    public function __construct($domain) {
      $this->domain = $domain;
    }

    public function extract($file) {
        $html = file_get_contents($file);
        //Create a new DOM document
        $dom = new DOMDocument;

        //Parse the HTML. The @ is used to suppress any parsing errors
        //that will be thrown if the $html string isn't valid XHTML.
        @$dom->loadHTML($html);

        //Get all links. You could also use any other tag name here,
        //like 'img' or 'table', to extract other tags.
        $links = $dom->getElementsByTagName('a');

        $results = [];
        //Iterate over the extracted links and display their URLs
        foreach ($links as $link){
            //Extract and sput the matching links in an array for the CSV
            $href = $link->getAttribute('href');
            $parts = parse_url($href);
            if (!empty($parts['path']) && strpos($this->domain, $parts['host']) !== false) {
                $results[$parts['path']] = [$parts['path'], $link->nodeValue];
            }
        }

        asort($results);
        // Make the CSV
        $fp = fopen($file .'.csv', 'w');
        foreach ($results as $fields) {
            fputcsv($fp, $fields);
        }
        fclose($fp);
    }
}

答案 6 :(得分:-1)

这是我为一位客户做的工作,它可以作为一种功能在任何地方使用。

function getValidUrlsFrompage($source)
  {
    $links = [];
    $content = file_get_contents($source);
    $content = strip_tags($content, "<a>");
    $subString = preg_split("/<\/a>/", $content);
    foreach ($subString as $val) {
      if (strpos($val, "<a href=") !== FALSE) {
        $val = preg_replace("/.*<a\s+href=\"/sm", "", $val);
        $val = preg_replace("/\".*/", "", $val);
        $val = trim($val);
      }
      if (strlen($val) > 0 && filter_var($val, FILTER_VALIDATE_URL)) {
        if (!in_array($val, $links)) {
          $links[] = $val;
        }
      }
    }
    return $links;
  }

并像使用它

$links = getValidUrlsFrompage("https://www.w3resource.com/");

预期的输出是在一个数组中获取99个URL,

Array ( [0] => https://www.w3resource.com [1] => https://www.w3resource.com/html/HTML-tutorials.php [2] => https://www.w3resource.com/css/CSS-tutorials.php [3] => https://www.w3resource.com/javascript/javascript.php [4] => https://www.w3resource.com/html5/introduction.php [5] => https://www.w3resource.com/schema.org/introduction.php [6] => https://www.w3resource.com/phpjs/use-php-functions-in-javascript.php [7] => https://www.w3resource.com/twitter-bootstrap/tutorial.php [8] => https://www.w3resource.com/responsive-web-design/overview.php [9] => https://www.w3resource.com/zurb-foundation3/introduction.php [10] => https://www.w3resource.com/pure/ [11] => https://www.w3resource.com/html5-canvas/ [12] => https://www.w3resource.com/course/javascript-course.html [13] => https://www.w3resource.com/icon/ [14] => https://www.w3resource.com/linux-system-administration/installation.php [15] => https://www.w3resource.com/linux-system-administration/linux-commands-introduction.php [16] => https://www.w3resource.com/php/php-home.php [17] => https://www.w3resource.com/python/python-tutorial.php [18] => https://www.w3resource.com/java-tutorial/ [19] => https://www.w3resource.com/node.js/node.js-tutorials.php [20] => https://www.w3resource.com/ruby/ [21] => https://www.w3resource.com/c-programming/programming-in-c.php [22] => https://www.w3resource.com/sql/tutorials.php [23] => https://www.w3resource.com/mysql/mysql-tutorials.php [24] => https://w3resource.com/PostgreSQL/tutorial.php [25] => https://www.w3resource.com/sqlite/ [26] => https://www.w3resource.com/mongodb/nosql.php [27] => https://www.w3resource.com/API/google-plus/tutorial.php [28] => https://www.w3resource.com/API/youtube/tutorial.php [29] => https://www.w3resource.com/API/google-maps/index.php [30] => https://www.w3resource.com/API/flickr/tutorial.php [31] => https://www.w3resource.com/API/last.fm/tutorial.php [32] => https://www.w3resource.com/API/twitter-rest-api/ [33] => https://www.w3resource.com/xml/xml.php [34] => https://www.w3resource.com/JSON/introduction.php [35] => https://www.w3resource.com/ajax/introduction.php [36] => https://www.w3resource.com/html-css-exercise/index.php [37] => https://www.w3resource.com/javascript-exercises/ [38] => https://www.w3resource.com/jquery-exercises/ [39] => https://www.w3resource.com/jquery-ui-exercises/ [40] => https://www.w3resource.com/coffeescript-exercises/ [41] => https://www.w3resource.com/php-exercises/ [42] => https://www.w3resource.com/python-exercises/ [43] => https://www.w3resource.com/c-programming-exercises/ [44] => https://www.w3resource.com/csharp-exercises/ [45] => https://www.w3resource.com/java-exercises/ [46] => https://www.w3resource.com/sql-exercises/ [47] => https://www.w3resource.com/oracle-exercises/ [48] => https://www.w3resource.com/mysql-exercises/ [49] => https://www.w3resource.com/sqlite-exercises/ [50] => https://www.w3resource.com/postgresql-exercises/ [51] => https://www.w3resource.com/mongodb-exercises/ [52] => https://www.w3resource.com/twitter-bootstrap/examples.php [53] => https://www.w3resource.com/euler-project/ [54] => https://w3resource.com/w3skills/html5-quiz/ [55] => https://w3resource.com/w3skills/php-fundamentals/ [56] => https://w3resource.com/w3skills/sql-beginner/ [57] => https://w3resource.com/w3skills/python-beginner-quiz/ [58] => https://w3resource.com/w3skills/mysql-basic-quiz/ [59] => https://w3resource.com/w3skills/javascript-basic-skill-test/ [60] => https://w3resource.com/w3skills/javascript-advanced-quiz/ [61] => https://w3resource.com/w3skills/javascript-quiz-part-iii/ [62] => https://w3resource.com/w3skills/mongodb-basic-quiz/ [63] => https://www.w3resource.com/form-template/ [64] => https://www.w3resource.com/slides/ [65] => https://www.w3resource.com/convert/number/binary-to-decimal.php [66] => https://www.w3resource.com/excel/ [67] => https://www.w3resource.com/video-tutorial/php/some-basics-of-php.php [68] => https://www.w3resource.com/video-tutorial/javascript/list-of-tutorial.php [69] => https://www.w3resource.com/web-development-tools/firebug-tutorials.php [70] => https://www.w3resource.com/web-development-tools/useful-web-development-tools.php [71] => https://www.facebook.com/w3resource [72] => https://twitter.com/w3resource [73] => https://plus.google.com/+W3resource [74] => https://in.linkedin.com/in/w3resource [75] => https://feeds.feedburner.com/W3resource [76] => https://www.w3resource.com/ruby-exercises/ [77] => https://www.w3resource.com/graphics/matplotlib/ [78] => https://www.w3resource.com/python-exercises/numpy/index.php [79] => https://www.w3resource.com/python-exercises/pandas/index.php [80] => https://w3resource.com/plsql-exercises/ [81] => https://w3resource.com/swift-programming-exercises/ [82] => https://www.w3resource.com/angular/getting-started-with-angular.php [83] => https://www.w3resource.com/react/react-js-overview.php [84] => https://www.w3resource.com/vue/installation.php [85] => https://www.w3resource.com/jest/jest-getting-started.php [86] => https://www.w3resource.com/numpy/ [87] => https://www.w3resource.com/php/composer/a-gentle-introduction-to-composer.php [88] => https://www.w3resource.com/php/PHPUnit/a-gentle-introduction-to-unit-test-and-testing.php [89] => https://www.w3resource.com/laravel/laravel-tutorial.php [90] => https://www.w3resource.com/oracle/index.php [91] => https://www.w3resource.com/redis/index.php [92] => https://www.w3resource.com/cpp-exercises/ [93] => https://www.w3resource.com/r-programming-exercises/ [94] => https://w3resource.com/w3skills/ [95] => https://creativecommons.org/licenses/by-nc-sa/3.0/deed.en_US [96] => https://www.w3resource.com/privacy.php [97] => https://www.w3resource.com/about.php [98] => https://www.w3resource.com/contact.php [99] => https://www.w3resource.com/feedback.php [100] => https://www.w3resource.com/advertise.php )

希望,这将对某人有所帮助。这是要点- https://gist.github.com/ManiruzzamanAkash/74cffb9ffdfc92f57bd9cf214cf13491