Selenium 中的网页抓取循环第一行和分页问题

时间:2021-03-26 13:40:35

标签: python selenium for-loop web-scraping pagination

我正在尝试使用 selenium 抓取 Javascript 页面,但遇到了一些麻烦。我正在尝试对所有行进行 for 循环,然后从这些行中提取表数据。这是这个网站:https://datawrapper.dwcdn.net/vzezR/4/

from selenium import webdriver
import time

url = 'https://datawrapper.dwcdn.net/vzezR/4/'

driver = webdriver.Chrome('G:/Python Projects/venv/Lib/site-packages/chromedriver.exe')
driver.get(url)

time.sleep(2)

partyData = (driver.find_elements_by_xpath('//tr'))
print(partyData)

for item in partyData:
    party = driver.find_element_by_xpath('.//td')
    party_leader = driver.find_element_by_xpath('./html/body/div/div[1]/div[2]/table/tbody//td[2]').text
    print (party, party_leader)

预期输出:

Rutte, M.
Kaag, S.
etc.

我得到的输出:

Rutte, M.
Rutte, M.
Rutte, M.
Rutte, M.
Rutte, M.
Rutte, M.
Rutte, M.
Rutte, M.
Rutte, M.
Rutte, M.
Rutte, M.

此外,我正在尝试抓取所有 159 个页面,但由于 URL 没有更改,并且网络选项卡中也没有任何更改。关于如何解决这个问题的任何建议?我正在考虑使用 GUI 让 Python“点击”下一页!

让我知道你们的想法!提前致谢!

1 个答案:

答案 0 :(得分:0)

在脚本标签中解析嵌入 json 格式的表会更容易:

from bs4 import BeautifulSoup
import requests
import json
import pandas as pd

url = 'https://datawrapper.dwcdn.net/vzezR/4/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
scripts = [script for script in scripts if script.string != None]
for script in scripts:
    if 'JSON.parse' in script.string:
        jsonStr = script.string.split('JSON.parse(',1)[-1]
        count=1
        while True and count < 50:
            try:
                jsonStr = jsonStr.rsplit(')',1)[0]
                jsonStr = json.loads(jsonStr)
                jsonData = json.loads(jsonStr)
                break
            except:
                count += 1
rows = []
for idx, line in enumerate(jsonData['data']['chartData'].splitlines()):
    if idx == 0:
        cols = line.split('\t')
        continue
    row = line.split('\t')
    rows.append(row)
    
df = pd.DataFrame(rows, columns=cols)

输出:

print(df)
                  Partij                   Naam Totaal Positie op kieslijst
0                 50PLUS         den Haan, N.L.  80533                    1
1                 50PLUS            Brood, R.G.   2581                    2
2                 50PLUS    Verkoelen, P.J.H.D.   4890                    3
3                 50PLUS          Nijkamp, M.O.    678                    4
4                 50PLUS  van Tilborg, H.C.A.M.   1446                    5
                 ...                    ...    ...                  ...
1576  Wij zijn Nederland          Schäfer, G.F.     19                    6
1577  Wij zijn Nederland           Mulder, P.J.     15                    8
1578  Wij zijn Nederland           Gilles, A.J.     14                    9
1579  Wij zijn Nederland         Hensen, Y.W.J.     72                   10
1580  Wij zijn Nederland           de Vries, D.     37                   11

[1581 rows x 4 columns]
相关问题