我正在尝试在python中附加一个URL,以从目标URL中抓取详细信息。 我有以下代码,但它似乎是从url1而非URL抓取数据。
我已经从NFL网站上删除了球队名称,没有任何问题。问题出在Spotrac URL上,我要附加从NFL网站上抓取的球队名称。
import requests
from bs4 import BeautifulSoup
URL ='https://www.nfl.com/teams/'
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')
team_name = []
team_name_list = soup.find_all('h4',class_='d3-o-media-object__roofline nfl-c-custom-promo__headline')
for team in team_name_list:
if team.find('p'):
team_name.append(team.text)
for team in team_name:
team = team.replace(" ", "-").lower()
url1 = 'https://www.spotrac.com/nfl/rankings/'
URL = url1 +str(team)
print(URL)
data = {
'ajax': 'true',
'mobile': 'false'
}
bs_soup = BeautifulSoup(requests.post(URL, data=data).content, 'html.parser')
spotrac_df = pd.DataFrame(columns = ['Name', 'Salary'])
for h3 in bs_soup.select('h3'):
spotrac_df = spotrac_df.append(pd.DataFrame({'Name': str(h3.text), 'Salary' : str(h3.find_next(class_="rank-value").text)}, index=[0]), ignore_index=False)
我几乎可以肯定问题出在URL附加不正确。抓取是从url1而非URL获取薪水等。
答案 0 :(得分:0)
URL正确附加了,但是您的团队名称中有一个空白空格。我还做了一些其他更改,并在代码中记下了它们。
最后,(并且我曾经做过这两个),创建一个空的数据框,然后在我认为不是最佳方法的每次迭代之后追加到它。有人告诉我,最好使用列表/字典来构造行,然后完成后,再调用pandas来构造数据框,因此也进行了更改。
import requests
from bs4 import BeautifulSoup
import pandas as pd
url ='https://www.nfl.com/teams/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
team_name = []
team_name_list = soup.find_all('h4',class_='d3-o-media-object__roofline nfl-c-custom-promo__headline')
for team in team_name_list:
if team.find('p'):
team_name.append(team.text.strip()) #<- remove leading/trailing white space
url1 = 'https://www.spotrac.com/nfl/rankings/' #<- since this is fixed, put it before the loop
spotrac_rows = []
for team in team_name:
team = '-'.join(team.split()).lower() #<- changed to split in case theres 2 spaces between city and team
url1 = 'https://www.spotrac.com/nfl/rankings/'
url = url1 + str(team)
print(url)
data = {
'ajax': 'true',
'mobile': 'false'
}
bs_soup = BeautifulSoup(requests.post(url, data=data).content, 'html.parser')
for h3 in bs_soup.select('h3'):
spotrac_rows.append({'Name': str(h3.text), 'Salary' : str(h3.find_next(class_="rank-value").text.strip())}) #<- remove white space from the salary
spotrac_df = pd.DataFrame(spotrac_rows)
输出:
print(spotrac_df)
Name Salary
0 Chandler Jones $21,333,333
1 Patrick Peterson $13,184,588
2 D.J. Humphries $12,800,000
3 DeAndre Hopkins $12,500,000
4 Larry Fitzgerald $11,750,000
5 Jordan Hicks $10,500,000
6 Justin Pugh $10,500,000
7 Kenyan Drake $8,483,000
8 Kyler Murray $8,080,601
9 Robert Alford $7,500,000
10 J.R. Sweezy $6,500,000
11 Corey Peters $4,437,500
12 Haason Reddick $4,288,444
13 Jordan Phillips $4,000,000
14 Isaiah Simmons $3,757,101
15 Maxx Williams $3,400,000
16 Zane Gonzalez $3,259,000
17 Devon Kennard $2,500,000
18 Budda Baker $2,173,184
19 De'Vondre Campbell $2,000,000
20 Andy Lee $2,000,000
21 Byron Murphy $1,815,795
22 Christian Kirk $1,607,691
23 Aaron Brewer $1,168,750
24 Max Garcia $1,143,125
25 Andy Isabella $1,052,244
26 Mason Cole $977,629
27 Zach Allen $975,855
28 Chris Banjo $887,500
29 Jonathan Bullard $887,500
... ...
2530 Khari Blasingame $675,000
2531 Kenneth Durden $675,000
2532 Cody Hollister $675,000
2533 Joey Ivie $675,000
2534 Greg Joseph $675,000
2535 Kareem Orr $675,000
2536 David Quessenberry $675,000
2537 Derick Roberson $675,000
2538 Shaun Wilson $675,000
2539 Cole McDonald $635,421
2540 Chris Jackson $629,570
2541 Kobe Smith $614,333
2542 Aaron Brewer $613,333
2543 Cale Garrett $613,333
2544 Tommy Hudson $613,333
2545 Kristian Wilkerson $613,333
2546 Khaylan Kearse-Thomas $612,500
2547 Nick Westbrook $612,333
2548 Kyle Williams $611,833
2549 Mason Kinsey $611,666
2550 Tucker McCann $611,666
2551 Cameron Scarlett $611,666
2552 Teair Tart $611,666
2553 Brandon Kemp $611,333
2554 Wyatt Ray $610,000
2555 Josh Smith $610,000
2556 Logan Woodside $610,000
2557 Rashard Davis $610,000
2558 Avery Gennesy $610,000
2559 Parker Hesse $610,000
[2560 rows x 2 columns]