当我尝试使用BeautifulSoup从网站上抓取时,缺少文本

时间:2020-09-15 16:16:12

标签: python beautifulsoup

我正在尝试从伦敦证券交易所的新闻文章中抓取正文,但是当我尝试使用BeautifulSoup将其拉出时,它没有出现。有谁知道我该如何获取这些信息?

单击检查后可以找到标签,但是在查看源代码(Ctrl + U)时,不会出现文本。我认为该信息可能是从另一个站点加载到该站点的,但是我不确定这一点,也不知道如何抓取。

我正在查看的网站是:https://www.londonstockexchange.com/news-article/PFG/interim-results-for-six-months-ended-30-june-2020/14665452

我正在尝试获取有关中期业绩的主要内容。

1 个答案:

答案 0 :(得分:0)

文章存储在页面内<script>标记内。您可以使用以下示例将其提取:

import json
import requests
from bs4 import BeautifulSoup


url = 'https://www.londonstockexchange.com/news-article/PFG/interim-results-for-six-months-ended-30-june-2020/14665452'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data = soup.select_one('#ng-lseg-state').string.replace('&q;', '"').replace('&l;', '<').replace('&g;', '>').replace('&a;', '&').replace('&s;', "'")
data = json.loads(data)

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

def find_news_article(data):
    if isinstance(data, dict):
        for k, v in data.items():
            if k == 'newsArticle':
                yield v
            else:
                yield from find_news_article(v)
    elif isinstance(data, list):
        for v in data:
            yield from find_news_article(v)

article = BeautifulSoup(next(find_news_article(data))['value'], 'html.parser')

# print text from article on screen:
print(article.get_text(strip=True, separator='\n'))

打印:

RNS Number : 1348X
Provident Financial PLC
26 August 2020
Provident Financial plc
Interim results for the six months ended 30 June 2020
Provident Financial plc ('the Group') is the leading provider of credit products to consumers who are underserved by mainstream lenders. The Group serves c.2.2 million customers and its operations consist of Vanquis Bank, Moneybarn, and the Consumer Credit Division ('CCD') comprising Provident home credit and Satsuma.

...and so on.