Question

的代码

import pandas as pd
import requests
from bs4 import BeautifulSoup

headers = {
    'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/91.0.4472.114 Safari/537.36'}
for url in df['URL']:
    source = requests.get(url, headers=headers)

    soup = BeautifulSoup(source.text, 'html.parser')
    main_div = soup.find("div", class_="main-menu2 main-menu-gray")
    a_tag = main_div.find_all("a")
    for i in a_tag:
        print(i['href'])

df.describe():

count  1171.000000
mean    585.000000
std     338.182889
min       0.000000
25%     292.500000
50%     585.000000
75%     877.500000
max    1170.000000

df.head():

|    |   Unnamed: 0 | URL                                                                 |
|----|--------------|---------------------------------------------------------------------|
|  0 |            0 | https://www.oddsportal.com/soccer/nigeria/npfl-pre-season/results/  |
|  1 |            1 | https://www.oddsportal.com/soccer/england/efl-cup/results/          |
|  2 |            2 | https://www.oddsportal.com/soccer/europe/guadiana-cup/results/      |
|  3 |            3 | https://www.oddsportal.com/soccer/world/kings-cup-thailand/results/ |
|  4 |            4 | https://www.oddsportal.com/soccer/poland/division-2-east/results/   |

因为有很多网址，所以在抓取时，我被网站屏蔽了，这是预期的，因为有很多网址

如何以批量大小为 50 分批抓取它，以便我可以避免此错误并将输出保存到数据帧？

如何在batces中抓取网址？

0 个答案: