我有一个可以抓取 www.oddsportal.com
的代码import pandas as pd
import requests
from bs4 import BeautifulSoup
headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/91.0.4472.114 Safari/537.36'}
for url in df['URL']:
source = requests.get(url, headers=headers)
soup = BeautifulSoup(source.text, 'html.parser')
main_div = soup.find("div", class_="main-menu2 main-menu-gray")
a_tag = main_div.find_all("a")
for i in a_tag:
print(i['href'])
df.describe():
count 1171.000000
mean 585.000000
std 338.182889
min 0.000000
25% 292.500000
50% 585.000000
75% 877.500000
max 1170.000000
df.head():
| | Unnamed: 0 | URL |
|----|--------------|---------------------------------------------------------------------|
| 0 | 0 | https://www.oddsportal.com/soccer/nigeria/npfl-pre-season/results/ |
| 1 | 1 | https://www.oddsportal.com/soccer/england/efl-cup/results/ |
| 2 | 2 | https://www.oddsportal.com/soccer/europe/guadiana-cup/results/ |
| 3 | 3 | https://www.oddsportal.com/soccer/world/kings-cup-thailand/results/ |
| 4 | 4 | https://www.oddsportal.com/soccer/poland/division-2-east/results/ |
因为有很多网址,所以在抓取时,我被网站屏蔽了,这是预期的,因为有很多网址
如何以批量大小为 50 分批抓取它,以便我可以避免此错误并将输出保存到数据帧?