从具有<div标签的网站上抓取/识别表格

时间:2020-06-16 02:06:05

标签: python web-scraping beautifulsoup

我希望使用BeautifulSoup从网站(https://datagolf.org/performance-table)中提取动态表。但是,当我使用soup.find()命令查找表的源代码时,输​​出中没有任何内容。这是我正在使用的代码:

url = 'https://datagolf.org/performance-table'
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.content, 'html.parser')
box = soup.find('div', {'class': 'table-div'})
box

上面代码的输出显示:

<div class="table-div">
</div>

当我将类更改为class_='table'时,输出将显示为空白。对这里发生的事情有什么想法吗?可能是我要求输入错误的源代码吗?

2 个答案:

答案 0 :(得分:1)

我尝试了漂亮的汤,但是没有用,但是它和硒一起用。 我为此编写了代码:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox(executable_path='geckodriver.exe')
driver.get("https://datagolf.org/performance-table")
l = []
l1 = []

#a = driver.find_element_by_class_name('table')
#print(a.text) # this will print all of the table content

b = driver.find_elements_by_class_name('datahead')
for d in b:
    l1.append(d.text)

l1.pop(5)    
l.append(l1)


c = driver.find_elements_by_class_name('datarow')
l1 = []
for d in c:
    e = d.text
    e = e.split('\n')
    l.append(e)

print(l) # this will print table as a list
driver.close()

答案 1 :(得分:1)

数据以Json格式存储在页面中,您可以使用re / json模块来解析数据。

例如:

import re
import json
import requests


url = 'https://datagolf.org/performance-table'
txt = requests.get(url).text
data = json.loads(re.search(r"var reload_data = JSON\.parse\('(.*?)'", txt).group(1))

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

# print some data to screen:
for row in data['data']['2020']['table']:
    print('{:<40} {}'.format(row['player_name'], row['wins']))

打印:

McIlroy, Rory                            1.0
Hatton, Tyrrell                          1.0
Rahm, Jon                                0.0
Thomas, Justin                           2.0
Schauffele, Xander                       0.0
Matsuyama, Hideki                        1.0
Reed, Patrick                            1.0
Woods, Tiger                             1.0

...and so on.

编辑:数据格式如下:

...
            {
                "amateur": 0,
                "app_raw": 0.9807287716094194,
                "app_true": 1.1416339999999998,
                "arg_raw": 0.30359835879467356,
                "arg_true": 0.35591150000000005,
                "dg_id": 10091,
                "events": 8,
                "exp_major_wins": 0.0,
                "exp_pga_wins": 1.5499999999999998,
                "flag": "NIR",
                "ott_raw": 0.699243421907403,
                "ott_true": 0.8408904999999999,
                "player_name": "McIlroy, Rory",
                "putt_raw": 0.07181996378995552,
                "putt_true": 0.16352450000000002,
                "rnds": 29,
                "sg_raw": 2.5018271707385242,
                "sg_true": 2.9106948275862066,
                "shotlink_rnds": 20.0,
                "t2g_raw": 1.983570552311496,
                "t2g_true": 2.3384359999999997,
                "tour": "PGA",
                "wins": 1.0
            },
...

您可以使用键app_trueputt_truearg_true等。