Question

我对编程很陌生，所以如果这很简单，我道歉。我已经掌握了非常基本的Python知识，并且一直在努力学习如何提取本网站上的表格：https://rotogrinders.com/grids/nfl-targets-1402017?site=draftkings。问题是表没有设置为传统的HTML表，而是实际上是由<div>制成的，并且似乎是通过脚本填充的？我一直在寻找我最难找到类似情况已经解决但但我不确定我是否正确搜索。到目前为止，这是我的代码：

import requests
from bs4 import BeautifulSoup

page = requests.get("https://rotogrinders.com/grids/nfl-targets-1402017?site=draftkings")

soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('div', attrs={'class': 'bat'})

print(table.prettify())

自从我遇到这个问题以来，我没有走得太远。如果您知道可能的解决方案或我可以学习的示例，请告诉我。

Answer 1

这种情况selenium方便，与BeautifulSoup结合使用。除了这两个，通常你需要使用浏览器仔细检查元素。

在这种情况下，我使用Firefox（需要geckodriver才能正确安装并放置在适当的位置），但您可以使用Chrome或您选择的任何浏览器，以及。

from selenium import webdriver
from bs4 import BeautifulSoup
from collections import OrderedDict
import more_itertools

# open Firefox to get the data

driver = webdriver.Firefox()
driver.get('https://rotogrinders.com/grids/nfl-targets-1402017?site=draftkings')
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()

# extract data from BeautifulSoup object

player_data = soup.find_all('div', attrs={'class':'rgt-col'})
text = [y.text for x in player_data for y in x.descendants if y.name == 'div']

indices_to_delete = [i for i in range(0, len(text), 250)]
keys = [text[k] for k in indices_to_delete]

new_text = [x for x in text if not x in keys]
text = list(more_itertools.sliced(new_text, 249))
new_text = list(zip(*text))

# build the dict

players = OrderedDict()

for x in new_text:
    y = list(zip(keys, x))
    for key, val in y:
        if key == 'Player':
            players[val] = {}
            current_player = val
        else:
            players[current_player][key] = val

...所以，当你print(players)时，你得到一个很好的OrderedDict：

OrderedDict([
    ('DeAndre Hopkins', {
        'Salary': '$6200', 
        'Pos': 'WR', 
        'Opp': 'NEP', 
        'Team': 'HOU', 
        'GP': '2', 
        'Targets': '29', 
        'RzTar': '3', 
        'PoW Tar': '48.33%', 
        'Week 1': '16', 
        'Week 2': '13', 
        'Week 3': '\xa0', 
        'Week 4': '\xa0', 
        'Yards': '128', 
        'YPT': '4.41', 
        'Rec': '14', 
        'Rec Rate': '48.28%'}), 
    ('Dez Bryant', {
        'Salary': '$6800', 
        'Pos': 'WR', 
        'Opp': 'ARI', 
        'Team': 'DAL', 
        'GP': '2', 
        'Targets': '25', 
        'RzTar': '5', 
        'PoW Tar': '28.74%', 
        'Week 1': '9', 
        'Week 2': '16', 
        'Week 3': '\xa0', 
        'Week 4': '\xa0', 
        'Yards': '102', 
        'YPT': '4.08', 
        'Rec': '9', 
        'Rec Rate': '36.00%'}
     ) ... ])

...这意味着您可以执行以下操作：

>>> players['DeAndre Hopkins']
{'Salary': '$6200', 'Pos': 'WR' ... }

钽哒！

Webscraping一个不使用的表（Python）

1 个答案: