Question

我是这个网页抓取世界及其图书馆的新手。我想从网站中提取某些信息，但找不到我要查找的文本。这是网站：“https://webgate.ec.europa.eu/rasff-window/screen/notification/486901”。

我想从网站中提取产品、产品类别、参考、主题等信息并将其放入数据框中。

我采用的方法是使用请求和 Beatifulsoup 库来提取文本，正如许多文章中所建议的那样。这是我正在使用的代码：

from bs4 import BeautifulSoup
import requests

url = 'https://webgate.ec.europa.eu/rasff-window/screen/notification/486901'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')

text = soup.find_all(text=True)

但是当我输入文本以找出其中的内容时，我得到了这样的信息：

text

['\n',
'doctype html',
'\n',
'\n',
'\n',
'\n',
'RASFF WINDOW',
'\n',
'\n',
'\n',
'\n',
' FOR OPEN ID IMPLEMENTATION\n\n    <script type="text/javascript" src="/rasff- 
window/assets/jsrsasign-all-min.js"></script>\n    <script type="text/javascript" src="/rasff- 
window/assets/oidc-client.min.js"></script>\n    <script type="text/javascript" src="/rasff-
.
.]

当我使用检查元素时，我看到我想要在页面中显示的信息但没有显示在我的文本中。此外，如果您检查信息不在任何动态 javascript 中，但不知何故它没有被提取为文本。

在这种情况下我该怎么办，我需要从相似的页面（1000 个）中提取信息，这就是我考虑不使用 selenium 来完成工作的原因。非常感谢任何建议或阅读。

Answer 1

数据是从外部来源加载的，所以 beautifulsoup 看不到它。您可以使用 requests 模块从他们的 API 中获取数据：

import json
import requests
import pandas as pd


api_url = "https://webgate.ec.europa.eu/rasff-window/backend/public/notification/view/id/486901/"
data = requests.get(api_url).json()

#uncomment this to print all returned data:
# print(json.dumps(data, indent=4))

# print some data:
print(data["reference"])
print(data["subject"])

# or create a df
#df = pd.json_normalize(data)
#print(df)

打印：

2021.3617
Salmonella in kipfilet

从不使用硒的网页中抓取内容

1 个答案: