Chromedriver在抓取时会不断更改时区

时间:2019-03-04 06:23:36

标签: python web-scraping selenium-chromedriver

下面是我的Python代码的开头,该代码成功地从website抓取了所有表信息并将其导出到CSV文件。我唯一遇到的问题是Chromedriver会不断更改右上角的时区,最终会通过为某些游戏分配不正确的日期来最终扭曲我的输出。我尝试在页面源中查找链接或标签,这些链接或标签将允许我单击“ GMT-8太平洋时区”,但是很遗憾,我找不到任何东西。令人沮丧的是,当我将网址复制并粘贴到浏览器中时,Chrome会立即切换回太平洋时区。有谁知道在使用Chromedriver抓取数据时如何解决此时区问题?预先感谢!

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
import pandas as pd

# set scope and create empty lists
year = 2018
lastpage = 50
Date = []
Time = []
Team1 = []
Team2 = []
Score = []
All_ML = []
Team1_ML = []
Team2_ML = []

driver = webdriver.Chrome()
driver.get('http://www.oddsportal.com/')
driver.execute_script('op.selectTimeZone(6);')

# set up for loop to loop through all pages
for x in range(1, lastpage + 1):
    url = "http://www.oddsportal.com/baseball/usa/mlb-" + str(year) + "/results/#/page/" + str(x) + "/'"
    driver.get(url)


    # wait until java table loads and then grab data
    element = WebDriverWait(driver, 10).until(
        EC.visibility_of_element_located((By.XPATH, '//*[@id="tournamentTable"]')))
    odds = element.text
    print (odds)

    # close temporary chrome screen
    driver.close()

    # reformat resulting text for consistency
    odds = re.sub("[0-9] - ", str(year)[-1] + " -- ", odds)
    odds = re.sub(" - ", "\nteam2", odds)

    # split text by line
    odds = odds.split("\n")

    counter = 1

    # set up loop to classify each line of text
    for line in odds:

        # if a game was abandoned or cancelled, set score to N/A
        if re.match(".*( {1})[a-zA-Z]*\.$", line):
            Score.append("N/A")

            # if date format is matched, add to date list and reset counter
        if re.match("(.{2} .{3} .{4}.*)", line):
            currdate = line[:11]
            Date.append(currdate)
            counter = 1

        # if time format is matched at beginning of string, add time to list, add team1 to list, check if there was a new date for this game. if not, add current date from previous game
        elif re.match('(.{2}:.{2})', line):
            Time.append(line[:5])
            Team1.append(line[6:])
            if counter > 1:
                Date.append(currdate)
            counter += 1

        # if its a team2 line, add to team2 list. if score is on the same line, add to score list
        elif re.match("team2.*", line):
            if re.match(".*:.*", line):
                Score.append(re.sub("[a-zA-Z]* *", "", line[-5:]))
                Team2.append(re.sub(" {1}[0-9]*:[0-9]*", "", line[5:len(line)]))
            else:
                Team2.append(re.sub(" {1}[a-zA-Z]*\.", "", line[5:]))

        # if score is on it's own line, add to score list
        elif re.match(".*:.*", line):
            Score.append(re.sub(" ", "", line))

        # add all moneylines to a list
        elif re.match("[+\-.*]", line):
            All_ML.append(line)

    # add odd money lines to list1, even moneylines to list 2
    Team1_ML = All_ML[0::2]
    Team2_ML = All_ML[1::2]

# create dataframe with all lists
df = pd.DataFrame(
    {'Date': Date,
     'Time': Time,
     'Team1': Team1,
     'Team2': Team2,
     'Score': Score,
     'Team1_ML': Team1_ML,
     'Team2_ML': Team2_ML})

# save
df.to_csv('odds2018.csv')

2 个答案:

答案 0 :(得分:1)

要充实pguardiario的笔记,如果您使用Chrome devtools查看右上角的按钮,则每个按钮都会触发到https://www.oddsportal.com/set-timezone/n/的链接,其中n是一些时区代码。这些功能实际上会触发功能op.selectTimeZone(n),该功能会更改屏幕上的时区。您可以在Chrome控制台中输入op.selectTimeZone(n)进行实验。

如果这对您有用,则可以使用来模拟控制台javascript调用,其中n是所选时区的代码:

driver.execute_script('op.selectTimeZone(n);')

您可以在每次驱动程序初始化调用后添加该值,以强制设置时区,例如:

for x in range(1, lastpage + 1):
    url = "http://www.oddsportal.com/baseball/usa/mlb-" + str(year) + "/results/#/page/" + str(x) + "/'"

   driver = webdriver.Chrome()
    driver.get(url)

    # Set timezone
    driver.execute_script('op.selectTimeZone(6);')

    # wait until java table loads and then grab data
    element = WebDriverWait(driver, 10).until(
        EC.visibility_of_element_located((By.XPATH, '//*[@id="tournamentTable"]')))
    odds = element.text

请注意,您可能需要设置等待计时器,因为要在选择的时区之后添加额外的执行。

此外,除非计划并行化for循环,否则确实不需要为每个循环重置驱动程序调用。如果您将驱动程序初始化并退出循环,则这可能会运行得更快。

编辑:

因此,如果直接访问结果页面,则似乎无法在不触发页面重新加载的情况下设置时区。您可能需要将设置和加载移出循环,例如

driver = webdriver.Chrome()
driver.get('http://www.oddsportal.com/')
# Proc JS on-click for timezone selection button
driver.execute_script("op.showHideTimeZone();ElementSelect.expand( 'user-header-timezone' , 'user-header-timezone-expander' , null , function(){op.hideTimeZone()} );this.blur();")
driver.execute_script('op.selectTimeZone(6);')

for x in range(1, lastpage + 1):
    url = "http://www.oddsportal.com/baseball/usa/mlb-" + str(year) + "/results/#/page/" + str(x) + "/'"

    driver.get(url)

    # wait until java table loads and then grab data
    element = WebDriverWait(driver, 10).until(
        EC.visibility_of_element_located((By.XPATH, '//*[@id="tournamentTable"]')))
    odds = element.text
    print(odds)
# close temporary chrome screen
driver.close()

答案 1 :(得分:0)

您似乎可以通过以下方式进行设置:

driver.get("https://www.oddsportal.com/set-timezone/6/")