网页抓取未找到正确的标签

时间:2021-02-17 10:08:27

标签: pandas beautifulsoup

我正在尝试使用 bs4 和 pandas 提取此页面的文本:https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033

我开始于:

src=requests.get(url).content
soup = BeautifulSoup(src,'xml')

看到我感兴趣的文字被p个标签包裹了,

enter image description here

但是当我运行 soup.find_all('p') 时,我得到的唯一返回是结束段落。

如何提取其中的段落文本?我错过了什么?

这些是我试图提取的段落:

enter image description here

我也尝试过使用硒:

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_driver = os.getcwd() + "\\chromedriver.exe"
driver = webdriver.Chrome(options = chrome_options, executable_path = chrome_driver)
driver.get(url)
page = driver.page_source
page_soup = BeautifulSoup(page,'xml')
div=page_soup.find_all('p')
[a.text for a in div]

1 个答案:

答案 0 :(得分:1)

我想通了。

该网站的正文来自一个 <script> 标记,该标记包含一个 JSON,但具有时髦的编码。

该标签的 id 为“ng-lseg-state”,这意味着这是 Angular 的自定义 HTML 编码。

您可以使用 <script> 定位 BeautifulSoup 标记并使用 json 模块解析它。

然后,您需要处理 Angular 的编码。一种方法,你有点粗糙,是链接一堆 .replace() 方法。

方法如下:

import json

import requests
from bs4 import BeautifulSoup

url = "https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"
script = BeautifulSoup(requests.get(url).text, "lxml").find("script", {"id": "ng-lseg-state"})
article = json.loads(script.string.replace("&q;", '"'))
main_key = "G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14850033&a;path=news-article"
article_body = article[main_key]["body"]["components"][1]["content"]["newsArticle"]["value"]
decoded_body = (
    article_body
    .replace('&l;', '<')
    .replace('&g;', '>')
    .replace('&q;', '"')
)
print(BeautifulSoup(decoded_body, "lxml").find_all("p")[22].getText())

输出:

Essentra plc is a FTSE 250 company and a leading global provider of essential components and solutions.&a;#160; Organised into three global divisions, Essentra focuses on the light manufacture and distribution of high volume, enabling components which serve customers in a wide variety of end-markets and geographies. 

然而,正如我所说,这不是最好的方法,因为我不完全确定如何处理一堆其他字符,即:

  • &amp;a;#160;
  • &amp;a;amp;
  • &amp;s;

仅举几例。但是I've already asked about this

编辑:

这是基于我上面提到的问题的答案的完整代码。

import html
import json

import requests
from bs4 import BeautifulSoup


def unescape(decoded_html):
    char_mapping = {
        '&a;': '&',
        '&q;': '"',
        '&s;': '\'',
        '&l;': '<',
        '&g;': '>',
    }
    for key, value in char_mapping.items():
        decoded_html = decoded_html.replace(key, value)
    return html.unescape(decoded_html)


url = "https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"
script = BeautifulSoup(requests.get(url).text, "lxml").find("script", {"id": "ng-lseg-state"})
payload = json.loads(unescape(script.string))
main_key = "G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14850033&path=news-article"
article_body = payload[main_key]["body"]["components"][1]["content"]["newsArticle"]["value"]
print(BeautifulSoup(article_body, "lxml").find_all("p")[22].getText())