Question

我正在使用soup.get_text（），但它将输出作为元数据。

url = "bit.ly/2DrYyhH"

r = requests.get(url)
print(type(r))

html = r.text

soup = BeautifulSoup(html, "lxml")
print(type(soup))

print(soup.title)
text = soup.get_text()

print(text)

输出：

if ( typeof pmc !== 'undefined' && typeof pmc.hooks !== 'undefined' ) {
    pmc.hooks.add_filter( 'pmc-adm-set-targeting-keywords', function( keywords ) {
        try {
            if ( typeof Krux !== 'undefined' ) {
                if ( typeof keywords['ksg'] === 'undefined' ) {
                    keywords['ksg']  = Krux.segments;
                }

Answer 1

这将获取网址中的所有文字，但所有script，meta，link和style代码除外：

import requests
from bs4 import BeautifulSoup, Comment

URL = "bit.ly/2DrYyhH"

r = requests.get(URL)
soup = BeautifulSoup(r.content, "lxml")

for text in soup.body.find_all(string=True):
    if text.parent.name not in ['script', 'meta', 'link', 'style'] and not isinstance(text, Comment) and text != '\n':
        print(text.strip())

使用BeautifulSoup从URL获取文本

1 个答案: