用隐藏列对jTable进行Web抓取?

时间:2019-01-18 19:30:37

标签: python selenium beautifulsoup jtable screen-scraping

我目前正尝试在Python中为以下网页设置网络抓取器:

https://understat.com/team/Juventus/2018

专门针对“团队玩家jTable”

我已经成功地使用BeautifulSoup和selenium成功地刮除了表格,但是隐藏的列(可通过选项弹出窗口访问)无法初始化,并且无法包含在刮除中。

有人知道如何更改吗?

<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.2.1/css/bootstrap.min.css">

<div class="container">
  <div class="row">
    <div class="col-sm-4 p-2">
      <div class="shadow p-5 bg-light border">
        COSILLAS
      </div>
    </div>
    <div class="col-sm-4 p-2">
      <div class="shadow p-5 bg-light border">
        COSILLAS
      </div>
    </div>
    <div class="col-sm-4 p-2">
      <div class="shadow p-5 bg-light border">
        COSILLAS
      </div>
    </div>

  </div>
</div>

如果您导航到该网站,则存在隐藏的表列,例如“ XGChain”。我希望所有这些隐藏的元素都被清除掉,但是这样做很麻烦。

最好, 凯尔

1 个答案:

答案 0 :(得分:0)

您在这里。您仍然可以使用BeautifulSoup遍历trtd标签,但是我总是发现大熊猫更容易获取表格,因为它可以为您工作。

from selenium import webdriver
import pandas as pd

url = 'https://understat.com/team/Juventus/2018'

driver = webdriver.Chrome()
driver.get(url)

# Click the Options Button
driver.find_element_by_xpath('//*[@id="team-players"]/div[1]/button/i').click()

# Click the fields that are hidden
hidden = [7, 12, 14, 15, 17, 19, 20, 21, 22, 23, 24]
for val in hidden:
    x_path = '//*[@id="team-players"]/div[2]/div[2]/div/div[%s]/div[2]/label' %val
    driver.find_element_by_xpath(x_path).click()

# Appy the filter    
driver.find_element_by_xpath('//*[@id="team-players"]/div[2]/div[3]/a[2]').click()

# get the tables in source
tables = pd.read_html(driver.page_source)
data = tables[1]
data.rename(columns={'Unnamed: 22':"Yellow_Cards", "Unnamed: 23":"Red_Cards"})


driver.close()

输出:

print (data.columns)
Index(['№', 'Player', 'Pos', 'Apps', 'Min', 'G', 'NPG', 'A', 'Sh90', 'KP90',
       'xG', 'NPxG', 'xA', 'xGChain', 'xGBuildup', 'xG90', 'NPxG90', 'xA90',
       'xG90 + xA90', 'NPxG90 + xA90', 'xGChain90', 'xGBuildup90',
       'Yellow_Cards', 'Red_Cards'],
      dtype='object')



print (data)
       №                 Player    ...     Yellow_Cards  Red_Cards
0    1.0      Cristiano Ronaldo    ...                2          0
1    2.0        Mario Mandzukic    ...                3          0
2    3.0           Paulo Dybala    ...                1          0
3    4.0  Federico Bernardeschi    ...                2          0
4    5.0         Blaise Matuidi    ...                2          0
5    6.0      Rodrigo Bentancur    ...                5          1
6    7.0          Juan Cuadrado    ...                2          0
7    8.0       Leonardo Bonucci    ...                1          0
8    9.0         Miralem Pjanic    ...                4          0
9   10.0           Sami Khedira    ...                0          0
10  11.0      Giorgio Chiellini    ...                1          0
11  12.0          Medhi Benatia    ...                2          0
12  13.0          Douglas Costa    ...                2          1
13  14.0               Emre Can    ...                2          0
14  15.0           Mattia Perin    ...                1          0
15  16.0      Mattia De Sciglio    ...                0          0
16  17.0      Wojciech Szczesny    ...                0          0
17  18.0        Andrea Barzagli    ...                0          0
18  19.0            Alex Sandro    ...                3          0
19  20.0         Daniele Rugani    ...                1          0
20  21.0             Moise Kean    ...                0          0
21  22.0           João Cancelo    ...                2          0
22   NaN                    NaN    ...               36          2

[23 rows x 24 columns]