BeatifulSoup4 get_text仍然有javascript

时间:2014-04-02 01:39:34

标签: python beautifulsoup nltk

我试图使用bs4删除所有html / javascript,但是,它并没有摆脱javascript。我仍然在那里看到它的文字。我怎么能绕过这个?

我尝试使用nltk但工作正常,clean_htmlclean_url将被移除。有没有办法使用汤get_text并获得相同的结果?

我试着查看其他这些页面:

BeautifulSoup get_text does not strip all tags and JavaScript

目前我正在使用nltk已弃用的功能。

修改

以下是一个例子:

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print soup.get_text()

我仍然看到CNN的以下内容:

$j(function() {
"use strict";
if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) {
var pushLib = window.safaripushLib,
current = pushLib.currentPermissions();
if (current === "default") {
pushLib.checkPermissions("helloClient", function() {});
}
}
});

/*globals MainLocalObj*/
$j(window).load(function () {
'use strict';
MainLocalObj.init();
});

如何删除js?

我找到的其他选项是:

https://github.com/aaronsw/html2text

html2text的问题在于它确实真的慢,并且会产生明显的延迟,这是nltk总是非常好用的一件事。

2 个答案:

答案 0 :(得分:73)

部分基于Can I remove script tags with BeautifulSoup?

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.decompose()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

答案 1 :(得分:8)

最后防止编码错误...

import urllib
from bs4 import BeautifulSoup

url = url
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))