我想从网页中检索一些数据项。
网页链接:
http://www.walmart.com/storeLocator/ca_storefinder_results.do
我想要检索的数据项。
我尝试了很多但是我做不到,因为没有为标签分配id或特定类,并且没有标签层次结构来获取每个标题的数据。
如果您看到上页的html源代码,那么<script>
标记内的变量形式中已有数据项可供使用,任何人都可以告诉我如何针对每个商店检索这些数据项
答案 0 :(得分:4)
stores
数组的元素。
如果您有时间学习一些新技术,我建议您查看phantomJS。
这个无头浏览器的行为就像浏览器一样,执行javascript。
打开控制台时,您会看到stores
数组是一个全局变量,因此使用phantomJS可以轻松读取此数组的内容。
phantomJS脚本示例:
var url = "your url";
var page = require('webpage').create();
page.open(url, function (status) {
var stores = page.evaluate(function () {
return window.stores;
});
console.log(stores[0]);
});
答案 1 :(得分:2)
我认为你必须使用正则表达式,尽管它并不完美。
$contents = file_get_contents('http://www.walmart.com/storeLocator/ca_storefinder_results.do?serviceName=&rx_title=com.wm.www.apps.storelocator.page.serviceLink.title.default&rx_dest=%2Findex.gsp&sfsearch_single_line_address=K6T');
preg_match_all('/stores\[(\d+)\] \= \{/s', $contents, $matches);
foreach ($matches[1] as $index) {
preg_match('/stores\[' . $index . '\] \= \{(.*?)\}\;/s', $contents, $matches);
preg_match_all('/\'([a-zA-Z0-9]+)\' \: ([^\,]*?)\,/s', $matches [1], $matches);
$c = count ($matches [1]);
for ($i=0; $i<$c; $i++) {
$results [$matches [1] [$i]] = trim($matches [2] [$i], "\'");
}
print_r ($results);
}
显示:
Array
(
[fullName] => Ogdensburg Walmart Store #2092
[street1] => 3000 Ford Street Ext
[city] => Ogdensburg
[state] => NY
[zipcode] => 13669
[phone] => (315) 394-8990
[latitude] => 44.7083
[longitude] => -75.4564
[storeName] => Walmart
[storeTypeId] => 2
[storeId] => 2092
[distance] => 22.01 miles
[directionsLink] => directionsLink
[directionsAvailable] => directionsAvailable
[directionsMessage] => directionsMessage
[hasOpen24HoursService] => false
[open24hrsMessage] => open24hrsMessage
[hoursWeekDays] => hoursWeekDays
[hoursSaturday] => hoursSaturday
[hoursSunday] => hoursSunday
[weekDays] => storeWeekDays
[weekEndSaturday] => storeSaturday
[weekEndSunday] => storeSunday
[storeInfoDays] => storeInfoDays
[storeInfoHours] => storeInfoHours
[moreDetailsLink] => moreDetailsLink
[openingSoon] => false
[recentlyOpen] => false
[siteToStoreAvailable] => true
[hasStoreEvent] => true
[eventLink] => http://localad.walmart.com/walmart/new_user_entry.aspx?storeref=2092&forceview=y
)
如果你想保留单引号('
),请删除trim()函数。