刮擦nfl资金线的赔率

时间:2020-06-11 17:31:03

标签: python string list dataframe web-scraping

我已经为此工作了两个星期,我想我已经很接近了,但是可以使用一些帮助。我一直在努力消除赔率,以吸引比赛和钱圈

browser = webdriver.Chrome()
browser.get("https://www.oddsportal.com/american-football/usa/nfl- 2017-2018/results/")
games = browser.find_element_by_class_name('table-main').text

这将返回一个字符串,其中表的所有行均以'\ n10'分隔,而所有表条目均以'\ n'分隔

 "American Football\n»\n USA\n»\nNFL 2018/2019\n03 Feb 2019 - Play Offs 1 2 B's\n22:30 Los Angeles Rams - New England Patriots 3:13\n+110\n-127\n10\n27 Jan 2019 - All Stars 1 2 B's\n19:00 NFC - AFC 7:26\n-106\n-118\n10\n20 Jan 2019 - Play Offs 1 2 B's\n22:40 Kansas City Chiefs - New England Patriots 31:37 OT\n-172\n+148\n10\n19:05 New Orleans Saints - Los Angeles Rams 23:26 OT\n-164\n+140\n10\n13 Jan 2019 - Play Offs 1 2 B's\n20:40 New Orleans Saints - Philadelphia Eagles 20:14\n-385\n+312\n10\n17:05 New England Patriots - Los Angeles Chargers 41:28\n-196\n+166\n10\n00:15 Los Angeles Rams - Dallas Cowboys 30:22\n-345\n+281\n10\n12 Jan 2019 - Play Offs 1 2 B's\n20:35 Kansas City Chiefs - Indianapolis Colts 31:13\n-208\n+175\n10\n06 Jan 2019 - Play Offs 1 2 B's\n20:40 Chicago Bears - Philadelphia Eagles 15:16\n-286\n+231\n10\n17:05 Baltimore Ravens - Los Angeles Chargers 17:23\n-149\n+129\n10\n00:15 Dallas Cowboys - Seattle Seahawks 24:22\n-149\n+129\n10\n05 Jan 2019 - Play Offs 1 2 B's\n20:35 Houston Texans - Indianapolis Colts 7:21\n-127\n+108\n10\n31 Dec 2018 1 2 B's\n00:20 Tennessee Titans - Indianapolis Colts 17:33\n+194\n-233\n10

如果我执行以下操作,我会走得更近,但我仍然不知道如何实现拥有4列数据框的最终目标:比赛日期,球队,钱圈1,钱圈2

game_list1 = re.split('\n10', table_main)

返回:

["American Football\n»\n USA\n»\nNFL 2017/2018\n04 Feb 2018 - Play Offs 1 2 B's\n22:30 New England Patriots - Philadelphia Eagles 33:41\n-196\n+173",
 "\n28 Jan 2018 - All Stars 1 2 B's\n19:00 AFC - NFC 24:23\n+124\n-147",
 "\n21 Jan 2018 - Play Offs 1 2 B's\n22:40 Philadelphia Eagles - Minnesota Vikings 38:7\n+129\n-147",
 '\n19:05 New England Patriots - Jacksonville Jaguars 24:20\n-333\n+279',
 "\n14 Jan 2018 - Play Offs 1 2 B's\n20:40 Minnesota Vikings - New Orleans Saints 29:24\n-233\n+197",
 '\n17:05 Pittsburgh Steelers - Jacksonville Jaguars 42:45\n-303\n+254',
 '\n00:15 New England Patriots - Tennessee Titans 35:14\n-909\n+608',

所以我想我越来越近了,但是由于在不同日期的不同游戏数量而导致模式发生变化时,我不知道从这里去哪里

数据框看起来像这样,但没有得分:

    date         game                money_line1   money_line2
0   04 Feb 2018  Patriots - Eagles   -196          +173

在此之前,我尝试遍历它,运行此命令将返回1行,因为看起来我要查找的每个唯一元素都具有类名odd.deactivate:

browser = webdriver.Chrome()
browser.get("https://www.oddsportal.com/american-football/usa/nfl-2018-2019/results/")
time.sleep(2)
tab_main = browser.find_element_by_class_name('odd.deactivate').text
tab_main

'22:30 Los Angeles Rams - New England Patriots 3:13\n+110\n-127\n10'

但是尝试使用元素和xpath遍历它没有用,这是我当前的尝试:

browser = webdriver.Chrome()
browser.get("https://www.oddsportal.com/american-football/usa/nfl-2018-2019/results/")
time.sleep(2)
tab_main = browser.find_elements_by_class_name('odd.deactivate')
game_list = []
for line in tab_main:
    game = line.find_element_by_xpath('/tbody/tr[4]/td[2]')
    ml1 = line.find_element_by_xpath('/tbody/tr[4]/td[4]')
    ml2 = line.find_element_by_xpath('/tbody/tr[4]/td[6]')
    game_row = (game, ml1, ml2)
    game_list.append(game_row)

这会产生以下错误:

---------------------------------------------------------------------------
NoSuchElementException                    Traceback (most recent call last)
<ipython-input-646-e1f07f8ecd68> in <module>
      5 game_list = []
      6 for line in tab_main:
----> 7     game = line.find_element_by_xpath('/tbody/tr[4]/td[2]')
      8     ml1 = line.find_element_by_xpath('/tbody/tr[4]/td[4]')
      9     ml2 = line.find_element_by_xpath('/tbody/tr[4]/td[6]')

~/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/webelement.py in find_element_by_xpath(self, xpath)
    349             element = element.find_element_by_xpath('//div/td[1]')
    350         """
--> 351         return self.find_element(by=By.XPATH, value=xpath)
    352 
    353     def find_elements_by_xpath(self, xpath):

~/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/webelement.py in find_element(self, by, value)
    657 
    658         return self._execute(Command.FIND_CHILD_ELEMENT,
--> 659                              {"using": by, "value": value})['value']
    660 
    661     def find_elements(self, by=By.ID, value=None):

~/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/webelement.py in _execute(self, command, params)
    631             params = {}
    632         params['id'] = self._id
--> 633         return self._parent.execute(command, params)
    634 
    635     def find_element(self, by=By.ID, value=None):

~/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py in execute(self, driver_command, params)
    319         response = self.command_executor.execute(driver_command, params)
    320         if response:
--> 321             self.error_handler.check_response(response)
    322             response['value'] = self._unwrap_value(
    323                 response.get('value', None))

~/anaconda3/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
    240                 alert_text = value['alert'].get('text')
    241             raise exception_class(message, screen, stacktrace, alert_text)
--> 242         raise exception_class(message, screen, stacktrace)
    243 
    244     def _value_or_default(self, obj, key, default):

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"/tbody/tr[4]/td[2]"}
  (Session info: chrome=81.0.4044.138)

1 个答案:

答案 0 :(得分:1)

由于它具有<connexionDesktop v-if="fromChild == false " v-on:child-to-parent="childMessage"> </connexionDesktop> ... data () { fromChild:false } methods: { childMessage (value) { alert('from child' + value ); this.fromChild = value } } ... 标签,因此我将使用pandas的<table>函数,因为它专门解析表标签。棘手的部分是数据有多个标头,然后是数据,因此您只需要弄清楚如何遍历这些标头即可。

.read_html()

输出:

from selenium import webdriver
import pandas as pd

browser = webdriver.Chrome()
browser.get("https://www.oddsportal.com/american-football/usa/nfl-2017-2018/results/")

df= pd.read_html(browser.page_source, header=0)[0]

dateList = []
gameList = []
money_line1List = []
money_line2List = []

for row in df.itertuples():
    if not isinstance(row[1], str):
        continue
    elif ':' not in row[1]:
        date = row[1].split('-')[0]
        continue
    time = row[1]
    dateList.append(date)
    gameList.append(row[2])
    money_line1List.append(row[5])
    money_line2List.append(row[6])

result = pd.DataFrame({'date':dateList,
                       'game':gameList,
                       'money_line1':money_line1List,
                       'money_line2':money_line2List})