在不一致的情况下,如何使网站始终从GET请求返回内容?

时间:2019-05-17 18:47:22

标签: beautifulsoup

我早些时候发布了一个类似的问题,但我认为这是一个更完善的问题。

我正在尝试抓取:https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=&PlayerMovementChkBx=yes&submit=Search&start=0

当我向URL发送GET请求时,我的代码随机引发错误。调试后,我看到了以下情况。将发送对以下网址的GET请求(示例网址,可能在任何页面上发生):https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=&PlayerMovementChkBx=yes&submit=Search&start=2400

然后该网页将显示“找不到匹配的交易”。但是,如果刷新页面,则将加载内容。我正在使用BeautifulSoup和Selenium,并在我的代码中放入了sleep语句,希望它可以工作,但无济于事。这是网站末端的问题吗?对我来说,一个GET请求不会返回任何内容,而完全相同的请求将返回一些内容,这对我来说没有任何意义。另外,有什么我可以修复的还是它失控了?

以下是我的代码示例: t

def scrapeWebsite(url, start, stop):
    driver = webdriver.Chrome(executable_path='/Users/Downloads/chromedriver')
    print(start, stop)


    madeDict = {"Date": [], "Team": [], "Name": [], "Relinquished": [], "Notes": []}

    #for i in range(0, 214025, 25):
    for i in range(start, stop, 25):
        print("Current Page: " + str(i))
        currUrl = url + str(i)
        #print(currUrl)
        #r = requests.get(currUrl)
        #soupPage = BeautifulSoup(r.content)

        driver.get(currUrl)
        #Sleep program for dynamic refreshing
        time.sleep(1)
        soupPage = BeautifulSoup(driver.page_source, 'html.parser')

        #page = urllib2.urlopen(currUrl)
        #time.sleep(2)
        #soupPage = BeautifulSoup(page, 'html.parser')


        info = soupPage.find("table", attrs={'class': 'datatable center'})
        time.sleep(1)
        extractedInfo = info.findAll("td")

该错误发生在最后一行。 “ findAll”抱怨是因为当内容为null时它找不到findAll(这意味着GET请求未返回任何内容)

1 个答案:

答案 0 :(得分:0)

我做了一些变通办法,以使用try except抓取所有页面。

请求循环可能是如此之快,页面无法支持它。

请参见下面的示例,它就像一个魅力:

import requests
from bs4 import BeautifulSoup

URL = 'https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=' \
      '&PlayerMovementChkBx=yes&submit=Search&start=%s'


def scrape(start=0, stop=214525):
    for page in range(start, stop, 25):
        current_url = URL % page

        print('scrape: current %s' % page)
        while True:
            try:
                response = requests.request('GET', current_url)
                if response.ok:
                    soup = BeautifulSoup(response.content.decode('utf-8'), features='html.parser')

                    table = soup.find("table", attrs={'class': 'datatable center'})
                    trs = table.find_all('tr')

                    slice_pos = 1 if page > 0 else 0
                    for tr in trs[slice_pos:]:
                        yield tr.find_all('td')

                    break
            except Exception as exception:
                print(exception)


for columns in scrape():
    values = [column.text.strip() for column in columns]
    # Continuous your code ...