所以我想废弃这个页面: http://www.asx.com.au/asx/statistics/todayAnns.do
似乎我的代码无法获取整个页面的HTML代码,它的行为非常奇怪。
我尝试过使用简单的html dom,但没有任何效果。
$base = "http://www.asx.com.au/asx/statistics/todayAnns.do";
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_URL, $base);
curl_setopt($curl, CURLOPT_REFERER, $base);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$str = curl_exec($curl);
curl_close($curl);
echo htmlspecialchars($str);
这主要显示javascript,我无法获取该页面。我的目标是废弃网址上的中间表。
答案 0 :(得分:1)
如果您不需要最新数据,则可以使用Google的页面缓存版本。
<?php
use Scraper\Scrape\Crawler\Types\GeneralCrawler;
use Scraper\Scrape\Extractor\Types\MultipleRowExtractor;
require_once(__DIR__ . '/../vendor/autoload.php');
date_default_timezone_set('UTC');
// Create crawler
$crawler = new GeneralCrawler(
'http://webcache.googleusercontent.com/search?q=cache:http://www.asx.com.au/asx/statistics/todayAnns.do&num=1&strip=0&vwsrc=0'
);
// Setup configuration
$configuration = new \Scraper\Structure\Configuration();
$configuration->setTargetXPath('//div[@class="page"]//table');
$configuration->setRowXPath('.//tr');
$configuration->setFields(
[
new \Scraper\Structure\TextField(
[
'name' => 'Headline',
'xpath' => './/td[3]',
]
),
new \Scraper\Structure\TextField(
[
'name' => 'Published',
'xpath' => './/td[1]',
]
),
new \Scraper\Structure\TextField(
[
'name' => 'Pages',
'xpath' => './/td[4]',
]
),
new \Scraper\Structure\AnchorField(
[
'name' => 'Link',
'xpath' => './/td[5]/a',
'convertRelativeUrl' => false,
]
),
new \Scraper\Structure\TextField(
[
'name' => 'Code',
'xpath' => './/text()',
]
),
]
);
// Extract data
$extractor = new MultipleRowExtractor($crawler, $configuration);
$data = $extractor->extract();
print_r($data);
我能够使用上面的代码获得以下数据。
Array
(
[0] => Array
(
[Code] => ASX
[hash] => 6e16c02b10a10baf739c2613bc87f906
)
[1] => Array
(
[Headline] => Initial Director's Interest Notice
[Published] => 10:57 AM
[Pages] => 1
[Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868833
[Code] => STO
[hash] => aa2ea9b1b9b0bc843a4ac41e647319b4
)
[2] => Array
(
[Headline] => Becoming a substantial holder
[Published] => 10:53 AM
[Pages] => 2
[Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868832
[Code] => AKG
[hash] => f8ff8dfde597a0fc68284b8957f38758
)
[3] => Array
(
[Headline] => LBT Investor Conference Call Business Update
[Published] => 10:53 AM
[Pages] => 9
[Link] => /asx/statistics/displayAnnouncement.do?display=pdf&idsId=01868831
[Code] => LBT
[hash] => cc78f327f2b421f46036de0fce270a6d
)
...
免责声明:我使用https://github.com/rajanrx/php-scrape框架和 我是该图书馆的作者。你也可以使用简单的卷曲来获取数据 上面列出的xpath。我希望这可能会有所帮助:)
答案 1 :(得分:0)
CURL只能加载页面的标记。上面的页面使用javascript加载页面后加载数据。您可能必须使用PhantomJS或Splash。
此链接可能有所帮助:https://stackoverflow.com/a/20554152/3086531
对于获取数据,在服务器端,我们可以使用phantomjs作为PHP内的库。在phantomjs中执行页面,然后使用exec命令将数据转储到php中。
本文有逐步完成的过程。 http://shout.setfive.com/2015/03/30/7817/