我正在使用soup.get_text(),但它将输出作为元数据。
url = "bit.ly/2DrYyhH"
r = requests.get(url)
print(type(r))
html = r.text
soup = BeautifulSoup(html, "lxml")
print(type(soup))
print(soup.title)
text = soup.get_text()
print(text)
输出:
if ( typeof pmc !== 'undefined' && typeof pmc.hooks !== 'undefined' ) {
pmc.hooks.add_filter( 'pmc-adm-set-targeting-keywords', function( keywords ) {
try {
if ( typeof Krux !== 'undefined' ) {
if ( typeof keywords['ksg'] === 'undefined' ) {
keywords['ksg'] = Krux.segments;
}
答案 0 :(得分:0)
这将获取网址中的所有文字,但所有script
,meta
,link
和style
代码除外:
import requests
from bs4 import BeautifulSoup, Comment
URL = "bit.ly/2DrYyhH"
r = requests.get(URL)
soup = BeautifulSoup(r.content, "lxml")
for text in soup.body.find_all(string=True):
if text.parent.name not in ['script', 'meta', 'link', 'style'] and not isinstance(text, Comment) and text != '\n':
print(text.strip())