无法使用Python刮擦幻想表

时间:2019-12-16 22:02:11

标签: python web-scraping beautifulsoup selenium-chromedriver

我正在尝试从以下站点中抓取奇幻播放器数据:http://www.fplstatistics.co.uk/。该表格会在打开网站时显示,但在我抓取网站时看不到。

我尝试了以下操作:

import requests as rq
from bs4 import BeautifulSoup

fplStatsPage = rq.get('http://www.fplstatistics.co.uk')
fplStatsPageSoup = BeautifulSoup(fplStatsPage.text, 'html.parser')
fplStatsPageSoup

桌子无处可看。代替表的位置应该是:

<div>
                The 'Player Data' is out of date.
                <br/> <br/>
                You need to refresh the web page.
                <br/> <br/>
                Press F5 or hit <i class="fa fa-refresh"></i>
</div>

每当更新表时,此消息就会出现在网站上。

然后,我查看了开发人员工具,以确定是否可以从检索表数据的位置找到URL,但是我没有运气。可能是因为我不知道如何很好地阅读开发人员工具。

然后我尝试使用Selenium刷新页面,如以上消息所述:

from selenium import webdriver
import time

chromeDriverPath = '/Users/SplitShiftKing/Downloads/Software/chromedriver'
driver = webdriver.Chrome(chromeDriverPath)
driver.get('http://www.fplstatistics.co.uk')
driver.refresh()
#To give site enough time to refresh
time.sleep(15)
html = driver.page_source
fplStatsPageSoup = BeautifulSoup(html, 'html.parser')
fplStatsPageSoup

输出与以前相同。该表将显示在网站上,而不显示在输出中。

我们将不胜感激。我曾经在溢出时查看过类似的问题,但我一直无法找到解决方案。

2 个答案:

答案 0 :(得分:1)

通过请求driver.page_source,您要取消从Selenium中获得的任何好处:页面源不包含所需的表。页面加载后,该表会通过Javascript动态更新。您需要在driver上检索表use方法,而不是使用BeautifulSoup。例如:

>>> from selenium import webdriver
>>> d = webdriver.Chrome()
>>> d.get('http://www.fplstatistics.co.uk')
>>> table = d.find_element_by_id('myDataTable')
>>> print('\n'.join(x.text for x in table.find_elements_by_tag_name('tr')))
Name
Club
Pos
Status
%Owned
Price
Chgs
Unlocks
Delta
Target
Kelly Crystal Palace D A 30.7 £4.3m 0 --- 0
101.0
Rico Bournemouth D A 14.6 £4.3m 0 --- 0
100.9
Baldock Sheffield Utd D A 7.1 £4.8m 0 --- 88 99.8
Rashford Man Utd F A 26.4 £9.0m 0 --- 794 98.6
Son Spurs M A 21.6 £10.0m 0 --- 833 98.5
Henderson Sheffield Utd G A 7.8 £4.7m 0 --- 860 98.4
Grealish Aston Villa M A 8.9 £6.1m 0 --- 1088 98.0
Kane Spurs F A 19.3 £10.9m 0 --- 3961 92.9
Reid West Ham D A 4.6 £3.9m 0 --- 4029 92.7
Richarlison Everton M A 7.7 £7.8m 0 --- 5405 90.3

答案 1 :(得分:1)

为什么不直接去获取我该数据的来源。唯一需要解决的是列名,但这可以在一个请求中获得所有数据,而无需使用硒:

import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

s = requests.Session()
url = 'http://www.fplstatistics.co.uk/'

headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Mobile Safari/537.36'}

response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
    if '"iselRow"' in script.text:
        iselRowVal = re.search('"value":(.+?)}\);}', script.text).group(1).strip()


url = 'http://www.fplstatistics.co.uk/Home/AjaxPricesFHandler'

payload = {
'iselRow': iselRowVal,
'_': ''}


jsonData = requests.get(url, params=payload).json()
df = pd.DataFrame(jsonData['aaData'])

输出:

print (df.head(5).to_string())
  0               1        2  3  4    5    6      7  8    9      10     11     12  13  14              15                                                16
0            Mustafi  Arsenal  D  A  0.3  5.2  £5.2m  0  ---    110  -95.6  -95.6  -1  -1         Mustafi  Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H) 
1           Bellerín  Arsenal  D  I  0.3  5.4  £5.4m  0  ---  54024    2.6    2.6  -2  -2        Bellerin  Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H) 
2          Kolasinac  Arsenal  D  I  0.6  5.2  £5.2m  0  ---   5464  -13.9  -13.9  -2  -2       Kolasinac  Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H) 
3     Maitland-Niles  Arsenal  D  A  2.6  4.6  £4.6m  0  ---  11924  -39.0  -39.0  -2  -2  Maitland-Niles  Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H) 
4           Sokratis  Arsenal  D  S  1.5  4.9  £4.9m  0  ---  19709  -29.4  -29.4  -2  -2        Sokratis  Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)