如何通过html源代码中的<script>标签从网页抓取数据。 (PHP)</SCRIPT>

时间:2012-06-02 20:03:11

标签: php web-scraping screen-scraping domdocument

我想从网页中检索一些数据项。

网页链接:

http://www.walmart.com/storeLocator/ca_storefinder_results.do

我想要检索的数据项。

  • $ StorePhone
  • $ PharmacyPhone
  • $ OpenHours
  • $ PharmacyHours

我尝试了很多但是我做不到,因为没有为标签分配id或特定类,并且没有标签层次结构来获取每个标题的数据。

如果您看到上页的html源代码,那么<script>标记内的变量形式中已有数据项可供使用,任何人都可以告诉我如何针对每个商店检索这些数据项

2 个答案:

答案 0 :(得分:4)

经典的抓取会给你带来这样一个页面的困难。您需要获取网址的内容,然后使用正则表达式来提取stores数组的元素。

如果您有时间学习一些新技术,我建议您查看phantomJS

这个无头浏览器的行为就像浏览器一样,执行javascript。

打开控制台时,您会看到stores数组是一个全局变量,因此使用phantomJS可以轻松读取此数组的内容。

phantomJS脚本示例:

var url = "your url";
var page = require('webpage').create();
page.open(url, function (status) {
   var stores = page.evaluate(function () {
       return window.stores;
   });
   console.log(stores[0]);
});

答案 1 :(得分:2)

我认为你必须使用正则表达式,尽管它并不完美。

$contents = file_get_contents('http://www.walmart.com/storeLocator/ca_storefinder_results.do?serviceName=&rx_title=com.wm.www.apps.storelocator.page.serviceLink.title.default&rx_dest=%2Findex.gsp&sfsearch_single_line_address=K6T');
preg_match_all('/stores\[(\d+)\] \= \{/s', $contents, $matches);
foreach ($matches[1] as $index) {
    preg_match('/stores\[' . $index . '\] \= \{(.*?)\}\;/s', $contents, $matches);
    preg_match_all('/\'([a-zA-Z0-9]+)\' \: ([^\,]*?)\,/s', $matches [1], $matches);
    $c = count ($matches [1]);
    for ($i=0; $i<$c; $i++) {
        $results [$matches [1] [$i]] = trim($matches [2] [$i], "\'");
    }
    print_r ($results);
}

显示:

Array
(
    [fullName] => Ogdensburg Walmart Store #2092
    [street1] => 3000 Ford Street Ext
    [city] => Ogdensburg
    [state] => NY
    [zipcode] => 13669
    [phone] => (315) 394-8990
    [latitude] => 44.7083
    [longitude] => -75.4564
    [storeName] => Walmart
    [storeTypeId] => 2
    [storeId] => 2092
    [distance] => 22.01 miles
    [directionsLink] => directionsLink
    [directionsAvailable] => directionsAvailable
    [directionsMessage] => directionsMessage
    [hasOpen24HoursService] => false
    [open24hrsMessage] => open24hrsMessage
    [hoursWeekDays] => hoursWeekDays
    [hoursSaturday] => hoursSaturday
    [hoursSunday] => hoursSunday
    [weekDays] => storeWeekDays
    [weekEndSaturday] => storeSaturday
    [weekEndSunday] => storeSunday
    [storeInfoDays] => storeInfoDays
    [storeInfoHours] => storeInfoHours
    [moreDetailsLink] => moreDetailsLink
    [openingSoon] => false
    [recentlyOpen] => false
    [siteToStoreAvailable] => true
    [hasStoreEvent] => true
    [eventLink] => http://localad.walmart.com/walmart/new_user_entry.aspx?storeref=2092&forceview=y
)

如果你想保留单引号('),请删除trim()函数。