我正在尝试从以下网址抓取动态内容:https://www.prokabaddi.com/stats/0-102-total-points-statistics。曾经尝试过使用硒,BeautifulSoup,但两者都为我提供了一个空列表。 我的代码是:
url = "https://www.prokabaddi.com/stats/0-102-total-points-statistics"
# create a new Chrome session
driver = webdriver.Chrome()
driver.get(url)
soup.find_all("div", class_="sipk-lb-playerName")
这将返回一个空列表。当我在控制台中检查数据时,该数据存在,但在页面源中,数据和 div 标记不存在。我相信这与js呈现的内容有关。
如何从此URL中提取玩家名称和得分。
答案 0 :(得分:2)
进入开发工具,然后查看XHR。您会看到直接提取数据的网址。它以json的形式返回,但可以将其转换为表格:
代码:
import requests
from pandas.io.json import json_normalize
url = 'https://www.prokabaddi.com/sifeeds/kabaddi/static/json/1_0_102_stats.json'
jsonData = requests.get(url).json()
table = json_normalize(jsonData['data'])
输出:
print (table.head(5).to_string())
match_played player_id player_name position_id position_name rank team team_full_name team_id team_name value
0 101 197 Pardeep Narwal 8.0 Raider 1 PAT Patna Pirates 6 PAT 1055
1 116 81 Rahul Chaudhari 8.0 Raider 2 TT Tamil Thalaivas 29 TT 987
2 118 41 Deepak Niwas Hooda 1.0 All Rounder 3 JAI Jaipur Pink Panthers 3 JAI 892
3 115 26 Ajay Thakur 8.0 Raider 4 TT Tamil Thalaivas 29 TT 811
4 88 326 Rohit Kumar 8.0 Raider 5 BEN Bengaluru Bulls 1 BEN 689
并过滤以仅获取名称和分数:
print (table[['player_name','value']])
player_name value
0 Pardeep Narwal 1055
1 Rahul Chaudhari 987
2 Deepak Niwas Hooda 892
3 Ajay Thakur 811
4 Rohit Kumar 689
5 Maninder Singh 673
6 Rishank Devadiga 619
7 Kashiling Adake 612
8 Anup Kumar 596
9 Pawan Kumar Sehrawat 572
10 Manjeet Chhillar 562
11 Sandeep Narwal 533
12 Monu Goyat 475
13 Jang Kun Lee 462
14 Sachin Tanwar 456
15 Nitin Tomar 445
16 Jasvir Singh 412
17 Rajesh Narwal 397
18 Sukesh Hegde 395
19 Meraj Sheykh 393
20 Naveen Kumar 364
21 Vikash Kandola 358
22 Prashanth Kumar Rai 358
23 K. Prapanjan 357
24 Shrikant Jadhav 342
25 Siddharth Sirish Desai 337
26 Ran Singh 319
27 Ravinder Pahal 317
28 Deepak Narwal 306
29 Wazir Singh 300
.. ... ...
359 Rohit Kumar Prajapat 1
360 Kazuhiro Takano 1
361 Inderpal Bishnoi 1
362 Amit Kumar 1
363 Sunil Subhash Lande 1
364 Atif Waheed 1
365 Nithesh B R 1
366 Mohammad Taghi Paein Mahali 1
367 Yong Joo Ok 1
368 Vishnu Uthaman 1
369 Ajvender Singh 1
370 Sanju 1
371 Ravinandan G.M. 1
372 Navjot Singh 1
373 Parvesh Attri 1
374 Hardeep Duhan 1
375 Parveen Narwal 1
376 Ajay Singh 1
377 Nitin Kumar 1
378 Jishnu 1
379 Naveen Narwal 1
380 M. Abishek 1
381 Vikas Chhillar 1
382 Aman 1
383 Satywan 1
384 Vikram Kandola 1
385 Emad Sedaghatnia 1
386 Aashish Nagar 1
387 Ajinkya Rohidas Kapre 1
388 Munish 1
[389 rows x 2 columns]
答案 1 :(得分:0)
以下是可能的基于selenium
的解决方案:
from selenium import webdriver
from bs4 import BeautifulSoup as soup
d = webdriver.Chrome('/Users/jamespetullo/Downloads/chromedriver')
d.get('https://www.prokabaddi.com/stats/0-102-total-points-statistics')
while True:
_b = [i for i in d.find_elements_by_tag_name('span') if 'load more' in i.text.lower()]
if _b:
_b[0].click()
else:
break
r = [{'name':i.find('div', {'class':'sipk-lb-playerName'}).get_text(strip=True), 'points':i.find('div', {'class':'sipk-lb-detailBlock sipk-lb-raidPoints'}).get_text(strip=True)} for i in
soup(d.page_source, 'html.parser').find_all('div', {'class':'sipk-lb-detailItem wl-team-detail'})]
print(len(r))
输出:
389
[{'name': 'Pardeep Narwal', 'points': '1055'}, {'name': 'Rahul Chaudhari', 'points': '987'}, {'name': 'Deepak Niwas Hooda', 'points': '892'}, {'name': 'Ajay Thakur', 'points': '811'}, {'name': 'Rohit Kumar', 'points': '689'}, {'name': 'Maninder Singh', 'points': '673'}, {'name': 'Rishank Devadiga', 'points': '619'}, {'name': 'Kashiling Adake', 'points': '612'}, {'name': 'Anup Kumar', 'points': '596'}, {'name': 'Pawan Kumar Sehrawat', 'points': '572'}......]