Question

我正在尝试使用PhantomJS获取给定网页中所有图像src网址的列表。我的理解是，这应该是非常容易的，但无论出于何种原因，我似乎无法使其发挥作用。这是我目前的代码：

var page = require('webpage').create();
page.open('http://www.walmart.com');

page.onLoadFinished = function(){
    var images = page.evaluate(function(){
        return document.getElementsByTagName("img");
    });
    for(thing in a){
        console.log(thing.src);
    }
    phantom.exit();
}

我也试过这个：

var a = page.evaluate(function(){
    returnStuff = new Array;
    for(stuff in document.images){
        returnStuff.push(stuff);
    }
    return returnStuff;
});

而且：

var page = require('webpage').create();
page.open('http://www.walmart.com', function(status){
    var images = page.evaluate(function() {
        return document.images;
    });
    for(image in images){
        console.log(image.src);
    }
    phantom.exit();
});

我还尝试迭代evaluate函数中的图像并以此方式获取.src属性他们都没有任何有意义的回报。如果我返回document.images的长度，页面上有54个图像，但尝试迭代它们没有任何用处。

此外，我查看了以下其他问题，但未能使用他们提供的信息：How to scrape javascript injected image src and alt with phantom.js和How to download images from a site with phantomjs

同样，我只想要源网址。我不需要实际的文件本身。谢谢你的帮助。

更新
我尝试使用

var a = page.evaluate(function(){
    returnStuff = new Array;
    for(stuff in document.images){
        returnStuff.push(stuff.getAttribute('src'));
    }
    return returnStuff;
});

它抛出一个错误，说stuff.getAttribute（'src'）返回undefined。知道为什么会这样吗？

Answer 1

@MayorMonty几乎就在那里。确实你不能返回HTMLCollection。

作为docs say：

注意：evaluate函数的参数和返回值必须是一个简单的原始对象。经验法则：如果它可以通过JSON序列化，那就没关系了。

闭包，函数，DOM节点等不起作用！

因此工作脚本是这样的：

var page = require('webpage').create();

page.onLoadFinished = function(){

    var urls = page.evaluate(function(){
        var image_urls = new Array;
        var images = document.getElementsByTagName("img");
        for(q = 0; q < images.length; q++){
            image_urls.push(images[q].src);
        }
        return image_urls;
    });    

    console.log(urls.length);
    console.log(urls[0]);

    phantom.exit();
}

page.open('http://www.walmart.com');

Answer 2

我不确定直接的JavaScript方法，但最近我使用jQuery来抓取图像和其他数据，这样你就可以在注入jQuery之后用下面的样式编写脚本

$('.someclassORselector').each(function(){
     data['src']=$(this).attr('src');
   });

Answer 3

document.images不是节点数组，而是HTMLCollection，它是由Object构建的。如果你for..in它可以看到这个：

for (a in document.images) {
  console.log(a)
}

打印：

0
1
2
3
length
item
namedItem

现在，有几种方法可以解决这个问题：

ES6 Spread Operator：这会将数组和迭代变成数组。像[...document.images]
常规for循环，就像数组一样。这利用了键被标记为数组的事实：
```
for(var i = 0; i < document.images.length; i++) {
  document.images[i].src
}
```

可能还有更多，

使用解决方案1允许您在其上使用数组函数，如map或reduce，但支持较少（如果幻影中当前版本的javascript支持，则为idk。）

Answer 4

我使用以下代码来加载页面上的所有图像，浏览器上加载的图像根据视口更改了尺寸，因为我想要最大尺寸，所以我使用了最大视口来获取实际尺寸图片大小。

使用Phantom JS获取页面上的所有图像使用Phantom JS在页面上下载所有图像URL

即使图像不在代码下面的img标签中也没有关系，您可以检索URL

即使从此类脚本中检索到的图像，

            @media screen and (max-width:642px) {
                .masthead--M4.masthead--textshadow.masthead--gradient.color-reverse {
                    background-image: url(assets/images/bg_studentcc-750x879-sm.jpg);
                }
            }
            @media screen and (min-width:643px) {
                .masthead--M4.masthead--textshadow.masthead--gradient.color-reverse {
                    background-image: url(assets/images/bg_studentcc-1920x490.jpg);
                }
            }

        var page =  require('webpage').create();
        var url = "https://......";

        page.settings.clearMemoryCaches = true;
        page.clearMemoryCache();
        page.viewportSize = {width: 1280, height: 1024};

        page.open(url, function (status) { 

            if(status=='success'){      
                console.log('The entire page is loaded.............################');
            }
        });

        page.onResourceReceived = function(response) {      
            if(response.stage == "start"){
                var respType = response.contentType;

                if(respType.indexOf("image")==0){           
                    console.log('Content-Type : ' + response.contentType)
                    console.log('Status : ' + response.status)
                    console.log('Image Size in byte : ' + response.bodySize)
                    console.log('Image Url : ' + response.url)
                    console.log('\n');
                }       
            }
        };

使用PhantomJS

4 个答案: