使用BeautifulSoup从URL获取文本

时间:2018-04-13 02:27:16

标签: python-3.x beautifulsoup

我正在使用soup.get_text(),但它将输出作为元数据。

url = "bit.ly/2DrYyhH"

r = requests.get(url)
print(type(r))

html = r.text

soup = BeautifulSoup(html, "lxml")
print(type(soup))

print(soup.title)
text = soup.get_text()

print(text)

输出:

if ( typeof pmc !== 'undefined' && typeof pmc.hooks !== 'undefined' ) {
    pmc.hooks.add_filter( 'pmc-adm-set-targeting-keywords', function( keywords ) {
        try {
            if ( typeof Krux !== 'undefined' ) {
                if ( typeof keywords['ksg'] === 'undefined' ) {
                    keywords['ksg']  = Krux.segments;
                }

1 个答案:

答案 0 :(得分:0)

这将获取网址中的所有文字,但所有scriptmetalinkstyle代码除外:

import requests
from bs4 import BeautifulSoup, Comment

URL = "bit.ly/2DrYyhH"

r = requests.get(URL)
soup = BeautifulSoup(r.content, "lxml")

for text in soup.body.find_all(string=True):
    if text.parent.name not in ['script', 'meta', 'link', 'style'] and not isinstance(text, Comment) and text != '\n':
        print(text.strip())