Question

我正试图从以下网站上获取某些价值：https://www.theice.com/productguide/ProductSpec.shtml?specId=6747556#data

具体来说，我试图从表格底部的表格中获取“最后”值，其中“data default borderless”类。问题是当我搜索该对象名称时，没有任何内容出现。

我使用的代码如下：

from bs4 import BeautifulSoup
import urllib2
url = "https://www.theice.com/productguide/ProductSpec.shtml?specId=6747556#data"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
result = soup.findAll(attrs={"class":"data default borderless"})
print result

我注意到的一个问题是，当我为该网址提取汤时，它会删除锚标记并向我显示网址的html：https://www.theice.com/productguide/ProductSpec.shtml?specId=6747556

我的理解是锚标签只是在页面上浏览你，但所有HTML都应该在那里，所以我想知道这个表是否以某种方式加载，除非你导航到“数据”部分网页。

有人知道在拉汤之前如何强制桌子加载？还有别的我做错了让我无法看到桌子吗？

提前致谢！

Answer 1

内容是通过以下js动态生成的：

<script type="text/javascript">
        var app = {};
        app.isOption = false;
        app.urls = {
            'spec':'/productguide/ProductSpec.shtml?details=&specId=6747556',
            'data':'/productguide/ProductSpec.shtml?data=&specId=6747556',
            'confirm':'/reports/dealreports/getSampleConfirm.do?hubId=4080&productId=3418',
            'reports':'/productguide/ProductSpec.shtml?reports=&specId=6747556',
            'expiry':'/productguide/ProductSpec.shtml?expiryDates=&specId=6747556'
        };
        app.Router = Backbone.Router.extend({
            routes:{
                "spec":"spec",
                "data":"data",
                "confirm":"confirm",
                "reports":"reports",
                "expiry":"expiry"
            },
            initialize: function(){
                _.bindAll(this, "spec");
            },
            spec:function () {
                this.navigate("");
                this._loadPage('spec');
            },
            data:function () {
                this._loadPage('data');
            },
            confirm:function () {
                this._loadPage('confirm');
            },
            reports:function () {
                this._loadPage('reports');
            },
            expiry:function () {
                this._loadPage('expiry');
            },
            _loadPage:function (cssClass, cb) {
                $('#right').html('Loading..').load(this._makeUrlUnique(app.urls[cssClass]), cb);
                this._updateNav(cssClass);
            },
            _updateNav:function (cssClass) {
                // the left bar gets hidden on margin rates because the tables get smashed up too much
                // so ensure they're showing for the other links
                $('#left').show();
                $('#right').removeClass('wide');
                // update the subnav css so the arrow points to the right location
                $('#subnav ul li a.' + cssClass).siblings().removeClass('on').end().addClass('on');
            },
            _makeUrlUnique:function (urlString) {
                return urlString + '&_=' + new Date().getTime();
            }
        });

        // init and start the app
        $(function () {
            window.router = new app.Router();
            Backbone.history.start();
        });
    </script>

你可以做两件事：1。找出用于提取数据的真实路径和变量，请参阅此部分“数据”：'/ productguide / ProductSpec.shtml？data =＆amp; specId = 6747556'，它将变量传递给数据字符串并获取内容。 2.使用他们提供的RSS提要并构建自己的表。

Answer 2

该表由JavaScript生成，如果不在浏览器中实际加载页面，则无法获取该表

或者你可以使用Selenium来加载页面，然后评估JavaScript和html，但是Selenium会调出并显示窗口，但是你可以使用Phantom.JS使浏览器无头

但是，您需要在浏览器中加载实际的js才能生成HTML

看一下这个answer

祝你好运！

Answer 3

HTML是使用Javascript生成的，因此BeautifulSoup将无法获取该表的HTML（实际上整个<div id="right" class="main">是使用Javascript加载的，我猜他们正在使用node.js）< / p>

您可以通过打印soup.get_text()的值来检查此项。您可以看到源表中没有该表。

在这种情况下，除非您使用Javascript完成脚本从服务器获取数据所做的工作，否则无法访问数据。

当我导航到网站中的某个锚点时，我试图抓取的html内容似乎只会加载

3 个答案: