Question

我正在尝试使用 bs4 和 pandas 提取此页面的文本：https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033

我开始于：

src=requests.get(url).content
soup = BeautifulSoup(src,'xml')

看到我感兴趣的文字被p个标签包裹了，

但是当我运行 soup.find_all('p') 时，我得到的唯一返回是结束段落。

如何提取其中的段落文本？我错过了什么？

这些是我试图提取的段落：

我也尝试过使用硒：

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_driver = os.getcwd() + "\\chromedriver.exe"
driver = webdriver.Chrome(options = chrome_options, executable_path = chrome_driver)
driver.get(url)
page = driver.page_source
page_soup = BeautifulSoup(page,'xml')
div=page_soup.find_all('p')
[a.text for a in div]

Answer 1

我想通了。

该网站的正文来自一个 <script> 标记，该标记包含一个 JSON，但具有时髦的编码。

该标签的 id 为“ng-lseg-state”，这意味着这是 Angular 的自定义 HTML 编码。

您可以使用 <script> 定位 BeautifulSoup 标记并使用 json 模块解析它。

然后，您需要处理 Angular 的编码。一种方法，你有点粗糙，是链接一堆 .replace() 方法。

方法如下：

import json

import requests
from bs4 import BeautifulSoup

url = "https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"
script = BeautifulSoup(requests.get(url).text, "lxml").find("script", {"id": "ng-lseg-state"})
article = json.loads(script.string.replace("&q;", '"'))
main_key = "G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14850033&a;path=news-article"
article_body = article[main_key]["body"]["components"][1]["content"]["newsArticle"]["value"]
decoded_body = (
    article_body
    .replace('&l;', '<')
    .replace('&g;', '>')
    .replace('&q;', '"')
)
print(BeautifulSoup(decoded_body, "lxml").find_all("p")[22].getText())

输出：

Essentra plc is a FTSE 250 company and a leading global provider of essential components and solutions.&a;#160; Organised into three global divisions, Essentra focuses on the light manufacture and distribution of high volume, enabling components which serve customers in a wide variety of end-markets and geographies.

然而，正如我所说，这不是最好的方法，因为我不完全确定如何处理一堆其他字符，即：

&a;#160;
&a;amp;
&s;

仅举几例。但是I've already asked about this。

编辑：

这是基于我上面提到的问题的答案的完整代码。

import html
import json

import requests
from bs4 import BeautifulSoup


def unescape(decoded_html):
    char_mapping = {
        '&a;': '&',
        '&q;': '"',
        '&s;': '\'',
        '&l;': '<',
        '&g;': '>',
    }
    for key, value in char_mapping.items():
        decoded_html = decoded_html.replace(key, value)
    return html.unescape(decoded_html)


url = "https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"
script = BeautifulSoup(requests.get(url).text, "lxml").find("script", {"id": "ng-lseg-state"})
payload = json.loads(unescape(script.string))
main_key = "G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14850033&path=news-article"
article_body = payload[main_key]["body"]["components"][1]["content"]["newsArticle"]["value"]
print(BeautifulSoup(article_body, "lxml").find_all("p")[22].getText())

网页抓取未找到正确的标签

1 个答案: