我想从各个球员那里获得几张桌子。当它搜索类似Sergio Rodriguez的人时,会出现多个名称(https://basketball.realgm.com/search?q=Sergio+Rodriguez),因此,它会转到“没有Sergio Rodriguez的国际表”而不是转到单个页面。在这三者中,我想进入NBA排名第二的塞尔吉奥·罗德里格斯(Sergio Rodriguez)的个人页面,并刮擦桌子,但我不确定该怎么做。我如何使用rel,因为这是唯一可行的方法。如果有帮助,则存在伪代码。谢谢。
HTML:
<tbody>
<tr>
<td class="nowrap tablesaw-cell-persist" rel="Rodriguez Febles, Sergio"><a href="/player/Sergio-Rodriguez-Febles/Summary/50443">Sergio Rodriguez Febles</a></td>
<td class="nowrap" rel="5">SF</td>
<td class="nowrap" rel="79">6-7</td>
<td class="nowrap" rel="202">202</td>
<td class="nowrap" rel="19931018"><a href="/info/birthdays/19931018/1">Oct 18, 1993</a></td>
<td class="nowrap" rel="2015"><a href="/nba/draft/past_drafts/2015" target="_blank">2015</a></td>
<td class="nowrap" rel="N/A">-</td>
<td rel="-">-</td>
</tr>
<tr>
<td class="nowrap tablesaw-cell-persist" rel="Rodriguez, Sergio"><a href="/player/Sergio-Rodriguez/Summary/85">Sergio Rodriguez</a></td>
<td class="nowrap" rel="1">PG</td>
<td class="nowrap" rel="75">6-3</td>
<td class="nowrap" rel="176">176</td>
<td class="nowrap" rel="19860612"><a href="/info/birthdays/19860612/1">Jun 12, 1986</a></td>
<td class="nowrap" rel="2006"><a href="/nba/draft/past_drafts/2006" target="_blank">2006</a></td>
<td class="nowrap" rel="N/A">-</td>
<td rel="NYK, PHL, POR, SAC"><a href="/nba/teams/New-York-Knicks/20/Rosters/Regular/2010">NYK</a>, <a href="/nba/teams/Philadelphia-Sixers/22/Rosters/Regular/2017">PHL</a>, <a href="/nba/teams/Portland-Trail-Blazers/24/Rosters/Regular/2009">POR</a>, <a href="/nba/teams/Sacramento-Kings/25/Rosters/Regular/2010">SAC</a></td>
</tr>
<tr>
<td class="nowrap tablesaw-cell-persist" rel="Rodriguez, Sergio"><a href="/player/Sergio-Rodriguez/Summary/39601">Sergio Rodriguez</a></td>
<td class="nowrap" rel="3">SG</td>
<td class="nowrap" rel="76">6-4</td>
<td class="nowrap" rel="-">-</td>
<td class="nowrap" rel="19771012"><a href="/info/birthdays/19771012/1">Oct 12, 1977</a></td>
<td class="nowrap" rel="1999"><a href="/nba/draft/past_drafts/1999" target="_blank">1999</a></td>
<td class="nowrap" rel="N/A">-</td>
<td rel="-">-</td>
</tr>
</tbody>
import requests
from bs4 import BeautifulSoup
import pandas as pd
playernames=['Carlos Delfino', 'Sergio Rodriguez']
result = pd.DataFrame()
for name in playernames:
fname=name.split(" ")[0]
lname=name.split(" ")[1]
url="https://basketball.realgm.com/search?q={}+{}".format(fname,lname)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# check the response url
if (response.url == "https://basketball.realgm.com/search..."):
# parse the search results, finding players who played in NBA
... get urls from the table ...
soup.table... # etc.
foreach url in table:
response = requests.get(player_url)
soup = BeautifulSoup(response.content, 'html.parser')
# call the parse function for a player page
...
parse_player(soup)
else: # we have a player page
# call the parse function for a player page, same as above
...
parse_player(soup)
try:
table1 = soup.find('h2',text='International Regular Season Stats - Per Game').findNext('table')
table2 = soup.find('h2',text='International Regular Season Stats - Advanced Stats').findNext('table')
df1 = pd.read_html(str(table1))[0]
df2 = pd.read_html(str(table2))[0]
commonCols = list(set(df1.columns) & set(df2.columns))
df = df1.merge(df2, how='left', on=commonCols)
df['Player'] = name
except:
print ('No international table for %s.' %name)
df = pd.DataFrame([name], columns=['Player'])
答案 0 :(得分:1)
Pandas有一个非常有用的方法来直接读取html。如果您希望从表中获取信息,这对您特别有用。基本上,熊猫会在网站上抓取任何表格,并将表格作为数据框读取。进一步了解here
这里的问题是,您还需要访问播放器的链接,并且read_html
方法会将表读取为文本,而不考虑标记。
尽管如此,我还是找到了可能的解决方案。这绝不是最好的方法,但是希望您可以使用和改进它。
方法是:
read_html
方法读取表格NBA != '-'
的玩家)Sergio Rodriguez
,但是只有第2个曾经参加过NBA-您将需要此索引,即index=1
(假设起始索引为0 )以稍后查找链接Sergio Rodriguez
的所有链接Sergio Rodriguez
import pandas as pd
import requests
from bs4 import BeautifulSoup
# read the data from the website as a list of dataframes (tables)
web_data = pd.read_html('https://basketball.realgm.com/search?q=Sergio+Rodriguez')
# the table you need is the second to last one
required_table = web_data[len(web_data)-2]
print (required_table)
>>>
Player Pos HT WT Birth Date Draft Year College NBA
0 Sergio Rodriguez Febles SF 6-7 202 Oct 18, 1993 2015 - -
1 Sergio Rodriguez PG 6-3 176 Jun 12, 1986 2006 - NYK, PHL, POR, SAC
2 Sergio Rodriguez SG 6-4 - Oct 12, 1977 1999 - -
### get the player name who has played in NBA
required_player_name = required_table.loc[required_table['NBA']!='-']['Player'].values[0]
print (required_player_name)
>>>
Sergio Rodriguez
## check for duplicate players with this name (reset index so that we get the indices of player with the same name in order)
table_with_player = required_table.loc[(required_table['Player']==required_player_name)].reset_index(drop=True)
# get the indices of player where NBA is not '-'
index_of_player_to_get = list(table_with_player[table_with_player['NBA']!='-'].index)[0]
print (index_of_player_to_get)
### basically if indices_of_player_to_get = 2 (say) then we need the 3rd link with player name == required_player_name
>>>
0
现在,我们可以阅读所有链接,并在所有名称为Sergio Rodriguez的链接中的index_of_player_to_get
位置拉出链接
url='https://basketball.realgm.com/search?q=Sergio+Rodriguez'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
## get all links
all_links = soup.find_all('a', href=True)
link_idx = -1
for link in all_links:
if link.text == required_player_name:
# player name found, inc link_idx
link_idx+=1
if link_idx == index_of_player_to_get:
print (link['href'])
>>>
/player/Sergio-Rodriguez/Summary/85
答案 1 :(得分:0)
因此,您知道您的rel
始终在表的第eigth列中,因此您可以执行以下操作:
soup = BeautifulSoup(html)
rows = [row for row in soup.find_all('tr')] # Get each row from the table
eighth_text = [col.find_all('td')[7].text for col in rows] # get text from eighth column
idx = [n for n,i in enumerate(eighth_text) if i!='-'] #Get the index of all rows that have text (are NBA players)
然后,您可以通过以下方式访问该(或那些)播放器:
for i in idx:
print(rows[i].a)
或者您要查找的任何属性。可能还有更多的Python方式,但是我优先考虑易于理解。