我不熟悉网络剪贴,我想从此website搜索结果中获取相关项目的名称。项目名称在h4标签内
该网站需要登录名和密码才能查看项目的详细信息,但我只想获取所有项目的列表。
环顾四周后,我意识到要拼凑结果,我必须从代码中提供输入。我正在使用的代码如下
import requests
from bs4 import BeautifulSoup as bs
headers = {
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Referer': 'https://www.devex.com/login?return_to=https%3A%2F%2Fwww.devex.com%2Ffunding%2Fr%3Freport%3Dgrant-21475%26query%255B%255D%3Dbig%2Bdata%26filter%255Bstatuses%255D%255B%255D%3Dforecast%26filter%255Bstatuses%255D%255B%255D%3Dopen%26filter%255Bupdated_since%255D%3D2019-09-02T04%253A57%253A27.714Z%26sorting%255Border%255D%3Ddesc%26sorting%255Bfield%255D%3D_score',
}
params = (
('report', 'grant-21475'),
('query[]', 'big data'),
('filter[statuses][]', ['forecast', 'open']),
('filter[updated_since]', '2019-09-02T04:57:27.714Z'),
('sorting[order]', 'desc'),
('sorting[field]', '_score'),
)
response = requests.get('https://www.devex.com/funding/r', headers=headers, params=params)
result = bs(response.text, 'html.parser')
我得到的结果不包含必需的标签或信息。 请让我知道。
谢谢
答案 0 :(得分:1)
内容是通过API调用动态返回的,您可以在浏览器的“网络”标签中找到
import requests
r = requests.get('https://www.devex.com/api/funding_projects?query[]=big+data&filter[statuses][]=forecast&filter[statuses][]=open&filter[updated_since]=2019-09-03T14:27:15.234Z&page[number]=1&page[size]=1000&sorting[order]=desc&sorting[field]=_score').json()
titles = [project['title'] for project in r['data']]
print(len(titles))
您可以循环更改页面参数和页面大小。第一个请求将告诉您总共有多少个结果。在这种情况下,我只是简单地输入了一个比预期结果要大的数字。
示例循环:
import requests
import math
titles = []
page_size = 500
with requests.Session() as s:
r = s.get(f'https://www.devex.com/api/funding_projects?query[]=big+data&filter[statuses][]=forecast&filter[statuses][]=open&filter[updated_since]=2019-09-03T14:27:15.234Z&page[number]=1&page[size]={page_size}&sorting[order]=desc&sorting[field]=_score').json()
total = int(r['total'])
titles += [project['title'] for project in r['data']]
if total > page_size:
num_pages = math.ceil(total/page_size)
for page in range(2, num_pages+1):
r = s.get(f'https://www.devex.com/api/funding_projects?query[]=big+data&filter[statuses][]=forecast&filter[statuses][]=open&filter[updated_since]=2019-09-03T14:27:15.234Z&page[number]={page}&page[size]={page_size}&sorting[order]=desc&sorting[field]=_score').json()
titles += [project['title'] for project in r['data']]
print(len(set(titles)))