我正在尝试使用 selenium 抓取 Javascript 页面,但遇到了一些麻烦。我正在尝试对所有行进行 for 循环,然后从这些行中提取表数据。这是这个网站:https://datawrapper.dwcdn.net/vzezR/4/
from selenium import webdriver
import time
url = 'https://datawrapper.dwcdn.net/vzezR/4/'
driver = webdriver.Chrome('G:/Python Projects/venv/Lib/site-packages/chromedriver.exe')
driver.get(url)
time.sleep(2)
partyData = (driver.find_elements_by_xpath('//tr'))
print(partyData)
for item in partyData:
party = driver.find_element_by_xpath('.//td')
party_leader = driver.find_element_by_xpath('./html/body/div/div[1]/div[2]/table/tbody//td[2]').text
print (party, party_leader)
预期输出:
Rutte, M.
Kaag, S.
etc.
我得到的输出:
Rutte, M.
Rutte, M.
Rutte, M.
Rutte, M.
Rutte, M.
Rutte, M.
Rutte, M.
Rutte, M.
Rutte, M.
Rutte, M.
Rutte, M.
此外,我正在尝试抓取所有 159 个页面,但由于 URL 没有更改,并且网络选项卡中也没有任何更改。关于如何解决这个问题的任何建议?我正在考虑使用 GUI 让 Python“点击”下一页!
让我知道你们的想法!提前致谢!
答案 0 :(得分:0)
在脚本标签中解析嵌入 json 格式的表会更容易:
from bs4 import BeautifulSoup
import requests
import json
import pandas as pd
url = 'https://datawrapper.dwcdn.net/vzezR/4/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
scripts = [script for script in scripts if script.string != None]
for script in scripts:
if 'JSON.parse' in script.string:
jsonStr = script.string.split('JSON.parse(',1)[-1]
count=1
while True and count < 50:
try:
jsonStr = jsonStr.rsplit(')',1)[0]
jsonStr = json.loads(jsonStr)
jsonData = json.loads(jsonStr)
break
except:
count += 1
rows = []
for idx, line in enumerate(jsonData['data']['chartData'].splitlines()):
if idx == 0:
cols = line.split('\t')
continue
row = line.split('\t')
rows.append(row)
df = pd.DataFrame(rows, columns=cols)
输出:
print(df)
Partij Naam Totaal Positie op kieslijst
0 50PLUS den Haan, N.L. 80533 1
1 50PLUS Brood, R.G. 2581 2
2 50PLUS Verkoelen, P.J.H.D. 4890 3
3 50PLUS Nijkamp, M.O. 678 4
4 50PLUS van Tilborg, H.C.A.M. 1446 5
... ... ... ...
1576 Wij zijn Nederland Schäfer, G.F. 19 6
1577 Wij zijn Nederland Mulder, P.J. 15 8
1578 Wij zijn Nederland Gilles, A.J. 14 9
1579 Wij zijn Nederland Hensen, Y.W.J. 72 10
1580 Wij zijn Nederland de Vries, D. 37 11
[1581 rows x 4 columns]