简单的HTML dom提取没有域名的链接

时间:2019-02-09 02:35:53

标签: php web-scraping

如何获取没有hdo.to名称的链接?使用简单的HTML dom。提取没有域名的链接

<?php
include('simple_html_dom.php');

$url = 'https://hdo.to/';

$html = file_get_html($url);
foreach($html->find('a[href^="https://hdo.to/country/"]') as $klk) {
 echo $links[] = $klk;
}

?>

This is what i am getting 
<a href="https://hdo.to/country/asia">Asia</a><a href="https://hdo.to/country/china">China</a><a href="https://hdo.to/country/euro">Euro</a><a href="https://hdo.to/country/france">France</a><a href="https://hdo.to/country/hongkong">HongKong</a><a href="https://hdo.to/country/India">India</a><a href="https://hdo.to/country/international">International</a><a href="https://hdo.to/country/japan">Japan</a><a href="https://hdo.to/country/korea">Korea</a><a href="https://hdo.to/country/taiwan">Taiwan</a><a href="https://hdo.to/country/thailand">Thailand</a><a href="https://hdo.to/country/united-kingdom">United Kingdom</a><a href="https://hdo.to/country/united-states">United States</a><a href=https://hdo.to/country/united-states title=United states>United States</a><a href=https://hdo.to/country/united-states title=United states>United States</a><a href=https://hdo.to/country/united-states title=United states>United States</a><a href=https://hdo.to/country/united-states title=United states>United States</a><a href=https://hdo.to/country/united-states title=United states>United States</a><a href=https://hdo.to/country/united-kingdom title=United kingdom>United Kingdom</a>

i Watnt可以获取不带域名的/ country / china这样的链接

1 个答案:

答案 0 :(得分:0)

尝试一下:

include('simple_html_dom.php');

$url = 'https://hdo.to/';

$html = file_get_html($url);
$links = array();

foreach($html->find('a[href^="https://hdo.to/country/"]') as $klk) {
  $link = $klk->href;
  $link_parts = parse_url( $link );
  $links[] = $link_parts['path'];
}

print_r( $links );

结果:

Array
(
    [0] => /country/asia
    [1] => /country/china
    [2] => /country/euro
    [3] => /country/france
    [4] => /country/hongkong
    [5] => /country/India
    [6] => /country/international
    [7] => /country/japan
    [8] => /country/korea
    [9] => /country/taiwan
    [10] => /country/thailand
    [11] => /country/united-kingdom
    [12] => /country/united-states
    [13] => /country/united-states
    [14] => /country/united-states
    [15] => /country/united-states
    [16] => /country/united-states
    [17] => /country/united-states
    [18] => /country/united-kingdom
)

编辑:

如果您只想从输出中删除域名:

include('simple_html_dom.php');

$url = 'https://hdo.to/';

$html = file_get_html($url);
$links = array();

foreach($html->find('a[href^="https://hdo.to/country/"]') as $klk) {
 echo $links[] = str_replace( 'https://hdo.to', '', $klk );
}

结果:

<a href="/country/asia">Asia</a>
<a href="/country/china">China</a>
<a href="/country/euro">Euro</a>
<a href="/country/france">France</a>
<a href="/country/hongkong">HongKong</a>
<a href="/country/India">India</a>
<a href="/country/international">International</a>
<a href="/country/japan">Japan</a>
<a href="/country/korea">Korea</a>
<a href="/country/taiwan">Taiwan</a>
<a href="/country/thailand">Thailand</a>
<a href="/country/united-kingdom">United Kingdom</a>
<a href="/country/united-states">United States</a>
<a href=/country/united-states title=United states>United States</a>
<a href=/country/united-states title=United states>United States</a>
<a href=/country/united-states title=United states>United States</a>
<a href=/country/united-states title=United states>United States</a>
<a href=/country/united-states title=United states>United States</a>
<a href=/country/united-kingdom title=United kingdom>United Kingdom</a>