正则表达式模块python提取内容

时间:2016-12-28 11:36:52

标签: python regex parsing beautifulsoup

我试图从javascript响应中获取变量'html'的内容。我正在使用正则表达式模块来提取html,但我输出了“无”。

response = 'var port_statistics = (function(window, undefined) {

function loadScript(url, callback) {
    var script = document.createElement('script');
    script.async = true;
    script.src = url;
    var entry = document.getElementsByTagName('script')[0];
    entry.parentNode.insertBefore(script, entry);
    script.onload = script.onreadystatechange = function() {
        var rdyState = script.readyState;
        if (!rdyState || /complete|loaded/.test(script.readyState)) {
            callback();
            script.onload = null;
            script.onreadystatechange = null;
        }
    };
}

function injectCss(css) {
    var style = document.createElement('style');
    style.type = 'text/css';
    css = css.replace(/\}/g, "}\n");
    if (style.styleSheet) {
        style.styleSheet.cssText = css;
    } else {
        style.appendChild(document.createTextNode(css));
    }
    var entry = document.getElementsByTagName('script')[0];
    entry.parentNode.insertBefore(style, entry);
}

var port_statistics = {};
var html = ["<div class=\"results_section\">", ", "
<div class='\"heading\"'> Overview </div> ",

 #HERE THE CONTENT I AM TRYING TO GET

 , "", "</div>", "", "", "</div>"].join('\n');

var div = document.createElement('div');
div.innerHTML = html;
var appendTo = document.getElementById('tag-port_statistics-widget');

appendTo.parentNode.insertBefore(div, appendTo);

loadScript('https://connect.url.com//jquery-1.11.1.min.js', function() {

    portWidget.$(function () {
        portWidget.$('tr.parent')
            .click(function () {
                portWidget.$(this).siblings('.child-' + this.id).fadeToggle('slow');
                portWidget.$(this).find('.plus').toggle();
                portWidget.$(this).find('.minus').toggle();
            });
    });
});

return port_statistics;

})(window);'

prog=re.search("var html = [.*?].join('\n');", response)
print(prog) #Output: None

我也试过这个:

soup = BeautifulSoup(response, 'html.parser')
print(soup.prettify())
div_search = re.search('["<div class=\"results_section\">",(.*), "</div>"]', soup.prettify(), re.IGNORECASE)
print(div_search.group(0)) #Output: v

我怎样才能获取变量'html'的内容呢? 在第二部分中,我想使用此内容来解析使用BeautifulSoup的HTML标记的内容。

谢谢。

修改

我想要得到这个:

  "<div class=\"results_section\">", ", "
<div class='\"heading\"'> Overview </div> ",

 #HERE THE CONTENT I AM TRYING TO GET

 , "", "</div>", "", "", "</div>"

1 个答案:

答案 0 :(得分:1)

result = re.search(r'var html = \[(.+?)\]', response, re.DOTALL)
print(result.group(1))
  

&#39;&#39;

     

(点。)在默认模式下,它匹配除之外的任何字符   换行。如果指定了 DOTALL 标志,则匹配任何标志   字符包括换行符

您的文字包含换行符,您需要使用DOTALL进行匹配。