无法使用发布请求从网页中获取某些内容

时间:2019-07-04 20:28:40

标签: python python-3.x web-scraping

我已经在python中创建了一个与硒相关联的脚本,以从网页的左侧边栏中抓取位于容器之类的框内的某些内容。当我使用硒时,我可以毫无困难地得到它们。现在,我想使用请求模块获取相同的内容。我在开发工具中进行了一些实验,发现有一个后发请求被发送,产生一些我在下面粘贴的json响应。但是,在这一点上,我对于如何使用请求来获取内容还是很困惑。

webpage link

硒处理方法:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def get_content(link):
    driver.get(link)
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#tab-outline"))).click()
    for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#pageoutline > [class^='outline_H']"))):
        print(item.text)

if __name__ == '__main__':
    url = "http://wave.webaim.org/report#/www.onewerx.com"
    with webdriver.Chrome() as driver:
        wait = WebDriverWait(driver,10)
        get_content(url)

脚本产生的部分输出(根据需要):

Marketing Mix Modeling
Programmatic & Modeling
Programmatic is buying digital advertising space automatically, with computers using data to decide which ads to buy and how much to pay for them.
Modern
Efficient
Scalable
Resultative
What is Modeling?
Modeling is an analytical approach that uses historic information, such as syndicated point-of-sale data and companies’ internal data, to quantify the sales impact of various marketing activities.
Programmatic - future of the marketing

尝试请求时:

import requests

url = "http://wave.webaim.org/data/request.php"

headers = {
    'Referer': 'http://wave.webaim.org/report',
    'X-Requested-With': 'XMLHttpRequest'
}

res = requests.post(url,data={'source':'http://www.onewerx.com'},headers=headers)
print(res.json())

我得到以下输出:

{'success': True, 'reportkey': '6520439253ac21885007b52c677b8078', 'contenttype': 'text/html; charset=UTF-8'}

如何使用请求获得相同的内容?

要更清楚: This is what I'm interested in

上面的输出看起来与图像不同,因为硒脚本单击了附加在该框上的以下按钮以扩展内容:

enter image description here

1 个答案:

答案 0 :(得分:1)

好吧,我做了一些逆向工程。
似乎整个过程都在客户端运行。方法如下:

wave.engine.statistics包含您要查找的结果:

// wave.min.js

wave.fn.applyRules = function() {
    var e = {};
    e.statistics = {};
    try {
        e.categories = wave.engine.run(),
        e.statistics = wave.engine.statistics;
        wave.engine.ruleTimes;
        e.statistics.pagetitle = wave.page.title,
        e.statistics.totalelements = wave.allTags.length,
        e.success = !0
    } catch (t) {
        console.log(t)
    }
    return e
}

此处wave.engine.run函数在客户端运行所有规则。 s<body>元素:

rules

并返回结果

wave.engine.run = function(e) {
    var t = new Date
      , n = null
      , i = null
      , a = new Date;
    wave.engine.fn.calculateContrast(this.fn.getBody());
    var o = new Date
      , r = wave.rules
      , s = $(wave.page);
    if (e)
        r[e] && r[e](s);
    else
        for (e in r) {
            n = new Date;
            try {
                r[e](s)
            } catch (l) {
                console.log("RULE FAILURE(" + e + "): " + l.stack)
            }
            i = new Date,
            this.ruleTimes[e] = i - n,
            config.debug && console.log("RULE: " + e + " (" + this.ruleTimes[e] + "ms)")
        }
    return EndTimer = new Date,
    config.debug && console.log("TOTAL RULE TIME: " + (EndTimer - t) + "ms"),
    a = new Date,
    wave.engine.fn.structureOutput(),
    o = new Date,
    wave.engine.results
}

因此,您有两个选择:将这些规则移植到Python中,或继续使用Selenium。

wave.rules = {},
wave.rules.text_justified = function(e) {
    e.find("p, div, td").each(function(t, n) {
        var i = e.find(n);
        "justify" == i.css("text-align") && wave.engine.fn.addIcon(n, "text_justified")
    })
}
,
wave.rules.alt_missing = function(e) {
    wave.engine.fn.overrideby("alt_missing", ["alt_link_missing", "alt_map_missing", "alt_spacer_missing"]),
    e.find("img:not([alt])").each(function(e, t) {
        var n = $(t);
        void 0 != n.attr("title") && 0 != n.attr("title").length || wave.engine.fn.addIcon(t, "alt_missing")
    })
}
// ... and many more

由于测试依赖于浏览器引擎来完全呈现页面(不幸的是,报告未在云中生成),因此您必须使用Selenium来完成这项工作