我需要从该网站的HTML表中提取赔率数据: http://data.nowgoal.com/1x2/Companyhistory.aspx?id=177&company=Pinnacle&matchdate=2020-06-06&ft=1
我想提取每个比赛的赔率问题是每个比赛都在2行中(打开和关闭)。
我创建了这段代码,但是返回了一个空的数据框
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs
import pandas as pd
import copy
import numpy as np
import time
results = []
d = webdriver.Chrome(executable_path = r'C:\chromedriver.exe')
u = "http://data.nowgoal.com/1x2/Companyhistory.aspx?id=177&company=Pinnacle&matchdate=2020-06-06&ft=1"
d.get(u)
WebDriverWait(d, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#main > div.pl_right > table")))
soup = bs(d.page_source, 'lxml')
rows = soup.select('#main > div.pl_right > table')
headers = ['Comp', 'Time', 'Match' ,'Odds', 'H','D', 'A', 'Res']
i = 1
for row in rows[1:]:
cols = [td.text for td in row.select('td')]
if (i % 2 == 1):
record = {'Comp' : cols[0],
'Time' : cols[1],
'Match' : ' v '.join([cols[2], cols[10]]),
'Odds' : 'op',
'H' : cols[3],
'D' : cols[4],
'A' : cols[5],
'Res' : cols[11]}
else:
record['Odds'] = 'cl'
record['H'] = cols[0]
record['D'] = cols[1]
record['A'] = cols[2]
results.append(copy.deepcopy(record))
i+=1
df = pd.DataFrame(results, columns = headers)
d.quit()
答案 0 :(得分:0)
此脚本提取表并将信息放入列表:
import re
import requests
from bs4 import BeautifulSoup
url = 'http://data.nowgoal.com/1x2/Companyhistory.aspx?id=177&company=Pinnacle&matchdate=2020-06-06&ft=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
all_data = []
for tr in soup.select('.schedule tr[id^="tr"]')[::2]:
row1 = [td.get_text(strip=True) for td in tr.select('td')]
row2 = [td.get_text(strip=True) for td in tr.find_next('tr').select('td')]
#extract date form <script> tag:
row1[1] = re.findall(r'\d+,\d+,\d+(?=\))', tr.select('td')[1].script.contents[0])[0]
row1 = row1[:3] + row1[3:10] + row2 + row1[10:-1]
all_data.append(row1)
# print on screen:
from pprint import pprint
pprint(all_data, width=250)
打印:
[['KFAC', '04,00,00', 'Gyeongju Citizen', '1.94', '3.48', '3.57', '47.60%', '26.54%', '25.87%', '92.34%', '1.83', '3.53', '3.94', '50.43%', '26.14%', '23.42%', '92.29%', 'Pyeongtaek Citizen', '0-0'],
['KFAC', '05,00,00', 'Paju Citizen FC', '2.87', '3.09', '2.43', '32.16%', '29.87%', '37.98%', '92.30%', '2.89', '3.06', '2.44', '31.96%', '30.18%', '37.85%', '92.36%', 'Gimpo FC', '2-2'],
['KFAC', '06,00,00', 'FC Anyang', '1.26', '5.18', '9.07', '72.35%', '17.60%', '10.05%', '91.16%', '1.33', '4.65', '7.67', '68.52%', '19.60%', '11.88%', '91.13%', 'Goyang FC', '2-0'],
['KFAC', '06,00,00', 'Jeju United', '1.09', '8.40', '20.12', '84.46%', '10.96%', '4.58%', '92.06%', '1.09', '8.40', '19.82', '84.41%', '10.95%', '4.64%', '92.01%', 'Songwol', '4-0'],
['KFAC', '06,00,00', 'Jeonnam Dragons', '1.14', '6.76', '14.22', '80.08%', '13.50%', '6.42%', '91.29%', '1.15', '6.51', '13.68', '79.32%', '14.01%', '6.67%', '91.22%', 'Chungju Citizen', '2-0'],
['KFAC', '07,00,00', 'Hwaseong FC', '2.71', '3.14', '2.53', '34.08%', '29.41%', '36.51%', '92.36%', '2.85', '2.98', '2.52', '32.39%', '30.98%', '36.63%', '92.31%', 'Daejeon Korail', '2-2'],
['KFAC', '07,00,00', 'Suwon City', '1.13', '7.00', '14.95', '80.84%', '13.05%', '6.11%', '91.35%', '1.13', '7.09', '15.16', '81.04%', '12.92%', '6.04%', '91.58%', 'Hyochang FC', '10-0'],
['KOR D1', '07,30,00', 'FC Seoul', '4.24', '3.39', '1.95', '22.60%', '28.26%', '49.14%', '95.82%', '5.03', '3.65', '1.76', '19.10%', '26.32%', '54.58%', '96.07%', 'Jeonbuk Hyundai Motors', '1-4'],
['INT CF', '08,00,00', 'Bohemians1905 B', '1.96', '4.27', '3.27', '48.58%', '22.30%', '29.12%', '95.22%', '', '', '', '', '', '', '', 'Slavia Prague B', '0-5'],
['INT CF', '08,00,00', 'Sepsi', '2.03', '3.24', '3.21', '44.27%', '27.74%', '28.00%', '89.87%', '1.68', '3.70', '3.98', '53.30%', '24.20%', '22.50%', '89.54%', 'Chindia Targoviste', '2-1'],
['KFAC', '08,00,00', 'Gyeongju KHNP', '1.23', '5.26', '10.32', '73.91%', '17.28%', '8.81%', '90.91%', '1.17', '6.07', '13.34', '78.10%', '15.05%', '6.85%', '91.38%', 'SMC Engineering', '4-0'],
['VIE U19', '08,00,00', 'Becamex Binh Duong U19', '1.15', '6.08', '10.27', '76.86%', '14.54%', '8.61%', '88.39%', '1.17', '5.78', '9.71', '75.59%', '15.30%', '9.11%', '88.44%', 'Can Tho U19', '6-0'],
['VIE U19', '08,00,00', 'Dong Tam Long An U19', '4.22', '3.64', '1.65', '21.20%', '24.58%', '54.22%', '89.46%', '4.48', '3.63', '1.61', '19.93%', '24.60%', '55.47%', '89.29%', 'Sai Gon FC U19', '1-2'],
['INT CF', '08,15,00', 'Admira Praha', '2.34', '3.99', '2.37', '38.85%', '22.79%', '38.36%', '90.91%', '2.27', '4.09', '2.41', '40.05%', '22.23%', '37.72%', '90.91%', 'Loko Vltavin', '3-1'],
... and so on.
答案 1 :(得分:0)
当熊猫可以使用RequestBody#isOneShot
为您解析表格时,这是一项艰巨的工作。它在引擎盖下使用BeautifulSoup。
此外,我假设开放是第一行,封闭是第二行。因此,只需按偶数/奇数索引值进行切片即可。
.read_html()
输出:
import pandas as pd
import requests
url = 'http://data.nowgoal.com/1x2/Companyhistory.aspx?id=177&company=Pinnacle&matchdate=2020-06-06&ft=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
response = requests.get(url, headers=headers)
df = pd.read_html(response.text, header=0)[0]
evenRows = list(df.index)[::2]
oddRows = list(df.index)[1::2]
open_df = df.take(evenRows)
close_df = df.take(oddRows)
或者,您似乎想填满表并输入print (open_df.head(10).to_string())
League Time Home HW D AW HWR DR AWR Return Away Score
0 KFAC showtime(2020,06-1,06,04,00,00) Gyeongju Citizen 1.94 3.48 3.57 47.60% 26.54% 25.87% 92.34% Pyeongtaek Citizen 0-0
2 KFAC showtime(2020,06-1,06,05,00,00) Paju Citizen FC 2.87 3.09 2.43 32.16% 29.87% 37.98% 92.30% Gimpo FC 2-2
4 KFAC showtime(2020,06-1,06,06,00,00) FC Anyang 1.26 5.18 9.07 72.35% 17.60% 10.05% 91.16% Goyang FC 2-0
6 KFAC showtime(2020,06-1,06,06,00,00) Jeju United 1.09 8.40 20.12 84.46% 10.96% 4.58% 92.06% Songwol 4-0
8 KFAC showtime(2020,06-1,06,06,00,00) Jeonnam Dragons 1.14 6.76 14.22 80.08% 13.50% 6.42% 91.29% Chungju Citizen 2-0
10 KFAC showtime(2020,06-1,06,07,00,00) Hwaseong FC 2.71 3.14 2.53 34.08% 29.41% 36.51% 92.36% Daejeon Korail 2-2
12 KFAC showtime(2020,06-1,06,07,00,00) Suwon City 1.13 7.00 14.95 80.84% 13.05% 6.11% 91.35% Hyochang FC 10-0
14 KOR D1 showtime(2020,06-1,06,07,30,00) FC Seoul 4.24 3.39 1.95 22.60% 28.26% 49.14% 95.82% Jeonbuk Hyundai Motors 1-4
16 INT CF showtime(2020,06-1,06,08,00,00) Bohemians1905 B 1.96 4.27 3.27 48.58% 22.30% 29.12% 95.22% Slavia Prague B 0-5
18 INT CF showtime(2020,06-1,06,08,00,00) Sepsi 2.03 3.24 3.21 44.27% 27.74% 28.00% 89.87% Chindia Targoviste 2-1
....
和'op'
,只需对代码做些微修改:
'cl'
答案 2 :(得分:-1)
在表的bs4 CSS选择器中发现错误
soup.select('#main > div.pl_right > table > tbody > tr')
我查看了您的代码,发现有很多情况/条件没有得到处理。
就像您没有处理日期<tr>
标记一样。