对于我的网站,我想添加一项新功能。
我希望用户能够上传他的书签备份文件(如果可能的话,从任何浏览器上传),这样我就可以将其上传到他们的个人资料中,而且他们不必手动插入所有这些...
我唯一缺少这样做的部分是从上传的文件中提取标题和URL的部分..任何人都可以提供线索从哪里开始或在哪里阅读?
使用了搜索选项和(how to extract data from a raw html file)这是我最相关的问题,它没有谈论它..
我真的不介意它是否使用jquery或php
非常感谢
答案 0 :(得分:53)
谢谢大家,我知道了!
最终代码: 这会显示已分配的锚点文本以及.html文件中所有链接的 href
$html = file_get_contents('bookmarks.html');
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
再次,非常感谢。
答案 1 :(得分:33)
这可能就足够了:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node)
{
echo $node->nodeValue.': '.$node->getAttribute("href")."\n";
}
答案 2 :(得分:5)
假设存储的链接在html文件中,最好的解决方案可能是使用html解析器,例如PHP Simple HTML DOM Parser(我自己从未尝试过)。 (另一种选择是使用基本字符串搜索或regexp进行搜索,你可能从不使用regexp来解析html)。
使用解析器读取html文件后,使用它的函数来查找a
标记:
来自教程:
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
答案 3 :(得分:3)
这是一个例子,您可以在这种情况下使用:
$content = file_get_contents('bookmarks.html');
运行:
<?php
$content = '<html>
<title>Random Website I am Crawling</title>
<body>
Click <a href="http://clicklink.com">here</a> for foobar
Another site is http://foobar.com
</body>
</html>';
$regex = "((https?|ftp)\:\/\/)?"; // SCHEME
$regex .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass
$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
$regex .= "(\:[0-9]{2,5})?"; // Port
$regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query
$regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor
$matches = array(); //create array
$pattern = "/$regex/";
preg_match_all($pattern, $content, $matches);
print_r(array_values(array_unique($matches[0])));
echo "<br><br>";
echo implode("<br>", array_values(array_unique($matches[0])));
输出:
Array
(
[0] => http://clicklink.com
[1] => http://foobar.com
)
答案 4 :(得分:1)
$html = file_get_contents('your file path');
$dom = new DOMDocument;
@$dom->loadHTML($html);
$styles = $dom->getElementsByTagName('link');
$links = $dom->getElementsByTagName('a');
$scripts = $dom->getElementsByTagName('script');
foreach($styles as $style)
{
if($style->getAttribute('href')!="#")
{
echo $style->getAttribute('href');
echo'<br>';
}
}
foreach ($links as $link){
if($link->getAttribute('href')!="#")
{
echo $link->getAttribute('href');
echo'<br>';
}
}
foreach($scripts as $script)
{
echo $script->getAttribute('src');
echo'<br>';
}
答案 5 :(得分:0)
我想从html页面创建CSV链接路径及其文本,以便可以从站点中翻录菜单等。
在此示例中,您指定了您感兴趣的域,这样您就不会离开站点链接,然后为每个文档生成CSV
/**
* Extracts links to the given domain from the files and creates CSVs of the links
*/
$LinkExtractor = new LinkExtractor('https://www.example.co.uk');
$LinkExtractor->extract(__DIR__ . '/hamburger.htm');
$LinkExtractor->extract(__DIR__ . '/navbar.htm');
$LinkExtractor->extract(__DIR__ . '/footer.htm');
class LinkExtractor {
public $domain;
public function __construct($domain) {
$this->domain = $domain;
}
public function extract($file) {
$html = file_get_contents($file);
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');
$results = [];
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and sput the matching links in an array for the CSV
$href = $link->getAttribute('href');
$parts = parse_url($href);
if (!empty($parts['path']) && strpos($this->domain, $parts['host']) !== false) {
$results[$parts['path']] = [$parts['path'], $link->nodeValue];
}
}
asort($results);
// Make the CSV
$fp = fopen($file .'.csv', 'w');
foreach ($results as $fields) {
fputcsv($fp, $fields);
}
fclose($fp);
}
}
答案 6 :(得分:-1)
这是我为一位客户做的工作,它可以作为一种功能在任何地方使用。
function getValidUrlsFrompage($source)
{
$links = [];
$content = file_get_contents($source);
$content = strip_tags($content, "<a>");
$subString = preg_split("/<\/a>/", $content);
foreach ($subString as $val) {
if (strpos($val, "<a href=") !== FALSE) {
$val = preg_replace("/.*<a\s+href=\"/sm", "", $val);
$val = preg_replace("/\".*/", "", $val);
$val = trim($val);
}
if (strlen($val) > 0 && filter_var($val, FILTER_VALIDATE_URL)) {
if (!in_array($val, $links)) {
$links[] = $val;
}
}
}
return $links;
}
并像使用它
$links = getValidUrlsFrompage("https://www.w3resource.com/");
预期的输出是在一个数组中获取99个URL,
Array ( [0] => https://www.w3resource.com [1] => https://www.w3resource.com/html/HTML-tutorials.php [2] => https://www.w3resource.com/css/CSS-tutorials.php [3] => https://www.w3resource.com/javascript/javascript.php [4] => https://www.w3resource.com/html5/introduction.php [5] => https://www.w3resource.com/schema.org/introduction.php [6] => https://www.w3resource.com/phpjs/use-php-functions-in-javascript.php [7] => https://www.w3resource.com/twitter-bootstrap/tutorial.php [8] => https://www.w3resource.com/responsive-web-design/overview.php [9] => https://www.w3resource.com/zurb-foundation3/introduction.php [10] => https://www.w3resource.com/pure/ [11] => https://www.w3resource.com/html5-canvas/ [12] => https://www.w3resource.com/course/javascript-course.html [13] => https://www.w3resource.com/icon/ [14] => https://www.w3resource.com/linux-system-administration/installation.php [15] => https://www.w3resource.com/linux-system-administration/linux-commands-introduction.php [16] => https://www.w3resource.com/php/php-home.php [17] => https://www.w3resource.com/python/python-tutorial.php [18] => https://www.w3resource.com/java-tutorial/ [19] => https://www.w3resource.com/node.js/node.js-tutorials.php [20] => https://www.w3resource.com/ruby/ [21] => https://www.w3resource.com/c-programming/programming-in-c.php [22] => https://www.w3resource.com/sql/tutorials.php [23] => https://www.w3resource.com/mysql/mysql-tutorials.php [24] => https://w3resource.com/PostgreSQL/tutorial.php [25] => https://www.w3resource.com/sqlite/ [26] => https://www.w3resource.com/mongodb/nosql.php [27] => https://www.w3resource.com/API/google-plus/tutorial.php [28] => https://www.w3resource.com/API/youtube/tutorial.php [29] => https://www.w3resource.com/API/google-maps/index.php [30] => https://www.w3resource.com/API/flickr/tutorial.php [31] => https://www.w3resource.com/API/last.fm/tutorial.php [32] => https://www.w3resource.com/API/twitter-rest-api/ [33] => https://www.w3resource.com/xml/xml.php [34] => https://www.w3resource.com/JSON/introduction.php [35] => https://www.w3resource.com/ajax/introduction.php [36] => https://www.w3resource.com/html-css-exercise/index.php [37] => https://www.w3resource.com/javascript-exercises/ [38] => https://www.w3resource.com/jquery-exercises/ [39] => https://www.w3resource.com/jquery-ui-exercises/ [40] => https://www.w3resource.com/coffeescript-exercises/ [41] => https://www.w3resource.com/php-exercises/ [42] => https://www.w3resource.com/python-exercises/ [43] => https://www.w3resource.com/c-programming-exercises/ [44] => https://www.w3resource.com/csharp-exercises/ [45] => https://www.w3resource.com/java-exercises/ [46] => https://www.w3resource.com/sql-exercises/ [47] => https://www.w3resource.com/oracle-exercises/ [48] => https://www.w3resource.com/mysql-exercises/ [49] => https://www.w3resource.com/sqlite-exercises/ [50] => https://www.w3resource.com/postgresql-exercises/ [51] => https://www.w3resource.com/mongodb-exercises/ [52] => https://www.w3resource.com/twitter-bootstrap/examples.php [53] => https://www.w3resource.com/euler-project/ [54] => https://w3resource.com/w3skills/html5-quiz/ [55] => https://w3resource.com/w3skills/php-fundamentals/ [56] => https://w3resource.com/w3skills/sql-beginner/ [57] => https://w3resource.com/w3skills/python-beginner-quiz/ [58] => https://w3resource.com/w3skills/mysql-basic-quiz/ [59] => https://w3resource.com/w3skills/javascript-basic-skill-test/ [60] => https://w3resource.com/w3skills/javascript-advanced-quiz/ [61] => https://w3resource.com/w3skills/javascript-quiz-part-iii/ [62] => https://w3resource.com/w3skills/mongodb-basic-quiz/ [63] => https://www.w3resource.com/form-template/ [64] => https://www.w3resource.com/slides/ [65] => https://www.w3resource.com/convert/number/binary-to-decimal.php [66] => https://www.w3resource.com/excel/ [67] => https://www.w3resource.com/video-tutorial/php/some-basics-of-php.php [68] => https://www.w3resource.com/video-tutorial/javascript/list-of-tutorial.php [69] => https://www.w3resource.com/web-development-tools/firebug-tutorials.php [70] => https://www.w3resource.com/web-development-tools/useful-web-development-tools.php [71] => https://www.facebook.com/w3resource [72] => https://twitter.com/w3resource [73] => https://plus.google.com/+W3resource [74] => https://in.linkedin.com/in/w3resource [75] => https://feeds.feedburner.com/W3resource [76] => https://www.w3resource.com/ruby-exercises/ [77] => https://www.w3resource.com/graphics/matplotlib/ [78] => https://www.w3resource.com/python-exercises/numpy/index.php [79] => https://www.w3resource.com/python-exercises/pandas/index.php [80] => https://w3resource.com/plsql-exercises/ [81] => https://w3resource.com/swift-programming-exercises/ [82] => https://www.w3resource.com/angular/getting-started-with-angular.php [83] => https://www.w3resource.com/react/react-js-overview.php [84] => https://www.w3resource.com/vue/installation.php [85] => https://www.w3resource.com/jest/jest-getting-started.php [86] => https://www.w3resource.com/numpy/ [87] => https://www.w3resource.com/php/composer/a-gentle-introduction-to-composer.php [88] => https://www.w3resource.com/php/PHPUnit/a-gentle-introduction-to-unit-test-and-testing.php [89] => https://www.w3resource.com/laravel/laravel-tutorial.php [90] => https://www.w3resource.com/oracle/index.php [91] => https://www.w3resource.com/redis/index.php [92] => https://www.w3resource.com/cpp-exercises/ [93] => https://www.w3resource.com/r-programming-exercises/ [94] => https://w3resource.com/w3skills/ [95] => https://creativecommons.org/licenses/by-nc-sa/3.0/deed.en_US [96] => https://www.w3resource.com/privacy.php [97] => https://www.w3resource.com/about.php [98] => https://www.w3resource.com/contact.php [99] => https://www.w3resource.com/feedback.php [100] => https://www.w3resource.com/advertise.php )
希望,这将对某人有所帮助。这是要点- https://gist.github.com/ManiruzzamanAkash/74cffb9ffdfc92f57bd9cf214cf13491