我正在尝试抓取该网站:https://fmdataba.com/19/p/621/toni-kroos/
,其中包含一些使用硒的足球运动员的比赛统计数据。
driver = webdriver.Chrome('chromedriver.exe')
driver.implicitly_wait(3)
driver.get('https://fmdataba.com/19/p/621/toni-kroos/')
# wait page to load
sleep(1)
data = driver.find_element_by_class_name('panel-body')
print(data.text)
这样做,我可以通过进行data.text.split('\n')
并打印来获得一些有用的信息。
for i, text in enumerate(data.text.split('\n')):
print(i, text)
这给了我
...more...
34 Value € 69.0M
35 Wage € 20,000,000 p/a
36 Status First Team
37 Contrat 30/6/2022
38 Pre. Foot Either
39 Position DM, M (C)
40 Best Alternatives
41 * Players with similar attributes order by value, each attributes (3-) / (3+)
42 TECHNICAL
43 Corners 18
44 Crossing 18
45 Dribbling 14
46 Finishing 13
47 First Touch 18
48 Free Kick 14
49 Heading 7
50 Long Shots 17
51 Long Throws 8
52 Marking 8
53 Passing 20
54 Penalty Taking 13
55 Tackling 9
56 Technique 16
...more...
然后我做下面的事情来解析我需要的信息
# 20: Age
bdate = player_info[20]
# 28: Nation
nation = player_info[28]
# 37: Foot
foot = player_info[37]
# 51 - 64: Tech
technical = {}
for stat in player_info[51:65]:
item = stat.split(' ')
if len(item) == 2:
ability, rate = item[0], item[1]
if len(item) == 3:
ability, rate = '{} {}'.format(item[0], item[1]), item[2]
technical[ability] = int(rate)
最后做了类似的事情
player_obj = {
'profile_img': img_url,
'name': name,
'birth_date': bdate,
'nation': nation,
'position': pos,
'foot': foot,
'abilities': abilities
}
完成我需要的对象。
但是,这不是一般性的,如果我在其他播放器的页面上尝试相同的操作,某些索引会显示不同的信息。
如何使它更笼统?
我想要给每个玩家的最终对象如下:
{
"profile_img": "https://fmdataba.com/images/p/3771.png",
"name": "Eden Hazard",
"birth_date": "7/1/1991",
"nation": "Belgium",
"position": "AM (RLC)",
"foot": "Either",
"abilities": {
"technical": {
"Corners": 12,
"Crossing": 12,
"Dribbling": 20,
"Finishing": 14,
"First Touch": 17,
"Free Kick": 13,
"Heading": 7,
"Long Shots": 11,
"Long Throws": 5,
"Marking": 3,
"Passing": 15,
"Penalty Taking": 19,
"Tackling": 4,
"Technique": 18
},
"mental": {
"Aggression": 8,
"Anticipation": 12,
"Bravery": 17,
"Composure": 15,
"Concentration": 13,
"Decisions": 16,
"Determination": 15,
"Flair": 18,
"Leadership": 6,
"Off The Ball": 14,
"Positioning": 7,
"Teamwork": 9,
"Vision": 16,
"Work Rate": 12
},
"physical": {
"Acceleration": 17,
"Agility": 20,
"Balance": 16,
"Jumping Reach": 8,
"Natural Fitness": 16,
"Pace": 16,
"Stamina": 17,
"Strength": 11
}
}
}
谢谢!
答案 0 :(得分:0)
编辑:只是意识到您没有使用Subplot1.bar(Lables, Values_Rows, align="center", width= 0.5, color= ['m','c','gold','yellowgreen']
(或任何其他HTML解析器)。在抓取网页时,使用字符串操作只会在您仅能解析HTML并使用它时带来麻烦。签出:https://www.crummy.com/software/BeautifulSoup/bs4/doc/。尝试将其合并到您的工作流程中。它将极大地帮助您。
对于表格抓取,找到所有行BeautifulSoup
,对于所有行,找到所有单元格<tr>
并将它们放入如下所示的列表中。
<td>
一些注意事项:
当def scrape_table(table: Tag) -> list:
rows = []
for row in table.find_all('tr'):
cells = [cell.text.strip() for cell in row.find_all('td')]
rows.append(cells)
return rows
带有几个标题时,您就不必像Selenium这样的大手枪了。大多数网站都设置了基本屏障,以阻止没有requests
标头的请求。在此处添加内容使我们可以很好地抓取页面。而且不必启动浏览器可以大大加快该过程。
如果您有User-Agent
对的列表,则可以使用[key, value]
函数将它们打包为字典。它适用于此页面,因为所有表行都只有一个统计名称和一个数字。
在这里,我故意复制了一些代码,但是例如,您可以通过标题轻松地将表搜索重构为dict
函数。
find_table_by_title
这个给你:
import requests
from bs4 import BeautifulSoup, Tag
def scrape_table(table: Tag) -> list:
rows = []
for row in table.find_all('tr'):
cells = [cell.text.strip() for cell in row.find_all('td')]
rows.append(cells)
return rows
def scrape_technical(soup: BeautifulSoup) -> dict:
# find table by column title
col_title_el = soup.find('h3', text='TECHNICAL')
# go up the parents until we find one that
# contains both column title and the table, but separate for all columns.
# .panel seems to fit our criteria
panel_el = col_title_el.find_parent(class_='panel')
# now we can find the table
table_el = panel_el.find('table')
rows = scrape_table(table_el)
return dict(rows)
def scrape_mental(soup: BeautifulSoup) -> dict:
col_title_el = soup.find('h3', text='MENTAL')
panel_el = col_title_el.find_parent(class_='panel')
table_el = panel_el.find('table')
rows = scrape_table(table_el)
return dict(rows)
def scrape_physical(soup: BeautifulSoup) -> dict:
col_title_el = soup.find('h3', text='TECHNICAL')
panel_el = col_title_el.find_parent(class_='panel')
table_el = panel_el.find('table')
rows = scrape_table(table_el)
return dict(rows)
def scrape_profile_page(url) -> dict:
res = requests.get(
url=url,
headers={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'
}
)
res.raise_for_status()
soup: BeautifulSoup = BeautifulSoup(res.text, 'html.parser')
technical = scrape_technical(soup)
mental = scrape_mental(soup)
physical = scrape_physical(soup)
return {
'technical': technical,
'mental': mental,
'physical': physical,
}
if __name__ == "__main__":
stats = scrape_profile_page('https://fmdataba.com/19/p/621/toni-kroos/')
from pprint import pprint
pprint(stats)