从网站获取链接标题

时间:2014-09-01 23:00:00

标签: php html web-scraping domdocument

您好,可以从此站点导出到txt文件:

http://bitinfocharts.com/top-100-richest-bitcoin-addresses.html

所有地址?

像:

1BPqtqBKoUjEq8STWmJxhPqtsf3BKp5UyE
1i7cZdoE9NcHSdAL5eGjmTJbBVqeQDwgw
etc...

我写这段代码:

<?
$html = file_get_contents('http://bitinfocharts.com/top-100-richest-bitcoin-addresses-5.html');
//Create a new DOM document
$dom = new DOMDocument;

//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);

//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');

//Iterate over the extracted links and display their URLs
foreach ($links as $link){
    //Extract and show the "href" attribute. 
    echo $link->getAttribute('href'), '<br>';
}
?>

但它会打印所有链接标题,我只需要地址......

2 个答案:

答案 0 :(得分:1)

这可以通过文本操作简单地完成:

// get page
$html = file_get_contents('http://bitinfocharts.com/top-100-richest-bitcoin-addresses.html');
// split on bit just in front of address
$parts = explode('./bitcoin/address/',$html);
// dump the first part
array_shift($parts);
// get addresses from all subsequent parts
foreach ($parts as $part) $addresses[] = substr($part,0,strpos($part,'"'));
// show result
echo implode('<br>',$addresses);

评论解释了代码。我承认,与DOM一起工作有其优雅。

答案 1 :(得分:1)

我要做的是定位每一行,然后定位锚链接。例如:

$html = file_get_contents('http://bitinfocharts.com/top-100-richest-bitcoin-addresses-5.html');
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXpath($dom);

$data = array();
$table_rows = $xpath->query('//h1[contains(text(), "Top 100 Richest Addresses Bitcoin")]/following-sibling::div[2]/table/tr');
foreach($table_rows as $row) {
    $cell = $xpath->query('./td[2]/a', $row);
    if($cell->length > 0) {
        $data[] = $cell->item(0)->nodeValue;

    }
}

echo '<pre>';
print_r($data);

//file_put_contents('your_file.txt', implode("\n", $data));

$data看起来像这样:(部分内容)

Array
(
    [0] => 1KcRjW2roV8dtZoBMPD83nsFburPCY7RfR
    [1] => 1LovisaJ31py5rr37y5xpt3MzSjErpoeLr
    [2] => 1BE1ttHnrJ7YKkLgKpiNrp8uT3kM6Y1xfg
    [3] => 1Czx5RKaDkiE56RwdeLXRYL57ZxxdFxwhb
    [4] => 1BhQDdQgVyAekFZjT1nW8PB5XRt9VJhRs5
    [5] => 1JsSF3YLF4v9Fasfu6pqevwWc5Mtyf76M3