我正在尝试使用 bs4 和 pandas 提取此页面的文本:https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033
我开始于:
src=requests.get(url).content
soup = BeautifulSoup(src,'xml')
看到我感兴趣的文字被p个标签包裹了,
但是当我运行 soup.find_all('p')
时,我得到的唯一返回是结束段落。
如何提取其中的段落文本?我错过了什么?
这些是我试图提取的段落:
我也尝试过使用硒:
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_driver = os.getcwd() + "\\chromedriver.exe"
driver = webdriver.Chrome(options = chrome_options, executable_path = chrome_driver)
driver.get(url)
page = driver.page_source
page_soup = BeautifulSoup(page,'xml')
div=page_soup.find_all('p')
[a.text for a in div]
答案 0 :(得分:1)
我想通了。
该网站的正文来自一个 <script>
标记,该标记包含一个 JSON
,但具有时髦的编码。
该标签的 id
为“ng-lseg-state”,这意味着这是 Angular 的自定义 HTML 编码。
您可以使用 <script>
定位 BeautifulSoup
标记并使用 json
模块解析它。
然后,您需要处理 Angular 的编码。一种方法,你有点粗糙,是链接一堆 .replace()
方法。
方法如下:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"
script = BeautifulSoup(requests.get(url).text, "lxml").find("script", {"id": "ng-lseg-state"})
article = json.loads(script.string.replace("&q;", '"'))
main_key = "G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14850033&a;path=news-article"
article_body = article[main_key]["body"]["components"][1]["content"]["newsArticle"]["value"]
decoded_body = (
article_body
.replace('&l;', '<')
.replace('&g;', '>')
.replace('&q;', '"')
)
print(BeautifulSoup(decoded_body, "lxml").find_all("p")[22].getText())
输出:
Essentra plc is a FTSE 250 company and a leading global provider of essential components and solutions.&a;#160; Organised into three global divisions, Essentra focuses on the light manufacture and distribution of high volume, enabling components which serve customers in a wide variety of end-markets and geographies.
然而,正如我所说,这不是最好的方法,因为我不完全确定如何处理一堆其他字符,即:
&a;#160;
&a;amp;
&s;
仅举几例。但是I've already asked about this。
编辑:
这是基于我上面提到的问题的答案的完整代码。
import html
import json
import requests
from bs4 import BeautifulSoup
def unescape(decoded_html):
char_mapping = {
'&a;': '&',
'&q;': '"',
'&s;': '\'',
'&l;': '<',
'&g;': '>',
}
for key, value in char_mapping.items():
decoded_html = decoded_html.replace(key, value)
return html.unescape(decoded_html)
url = "https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"
script = BeautifulSoup(requests.get(url).text, "lxml").find("script", {"id": "ng-lseg-state"})
payload = json.loads(unescape(script.string))
main_key = "G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14850033&path=news-article"
article_body = payload[main_key]["body"]["components"][1]["content"]["newsArticle"]["value"]
print(BeautifulSoup(article_body, "lxml").find_all("p")[22].getText())