我正在尝试从this page中提取信息,其中包含HTML,如下所示。
我正在尝试提取第一个class="currentServers"
中的文字(例如:我从此行<span class="currentServers">745,807</span>
获得 745,807
问题在于,行中有两个类名为class="currentServers"
的跨度。我想在行的第一列中获取值。
HTML:
<tr class="player_count_row" style="">
<td align="right">
<span class="currentServers">745,807</span>
</td>
<td align="right">
<span class="currentServers">836,540</span>
</td>
<td width="20"> </td>
<td>
<a class="gameLink" onmouseover="GameHover( this, event, 'global_hover', {"type":"app","id":570,"v6":1} );" onmouseout="HideGameHover( this, event, 'global_hover' )" href="http://store.steampowered.com/app/570/">Dota 2</a>
</td>
</tr>
我觉得我很亲密,但我无法理解。
这就是我的尝试:
def GetTopGamesByPlayers():
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
r = []
final_link = soup.p.a
final_link.decompose()
links = soup.findAll("a", { "class" : "gameLink" })
currentPlayers = soup.findAll("span", {"class" : "currentServers"})
players = ""
i = 0
for player in currentPlayers :
for link in links:
players = currentPlayers[0].text
try:
appid = link.get('onmouseover')
appid = findAppIdFromStats(appid,'"id":' , ',"public":1')
linkg = link.get('href')
except AttributeError:
r.append(["N/A","N/A","N/A"])
r.append([appid,linkg,players])
c = ["N/A","N/A", "N/A"]
while c in r:
r.remove(c)
return r
def findAppIdFromStats( s, first, last ):
try:
start = s.index( first ) + len( first )
end = s.index( last, start )
return s[start:end]
except ValueError:
return "first: " + first + "last: " + last
以下是输出:
[u'346110', u'http://store.steampowered.com/app/346110/', u'745,807']
[u'230410', u'http://store.steampowered.com/app/230410/', u'745,807']
[u'252950', u'http://store.steampowered.com/app/252950/', u'745,807']
[u'482730', u'http://store.steampowered.com/app/482730/', u'745,807']
[u'252490', u'http://store.steampowered.com/app/252490/', u'745,807']
[u'4000', u'http://store.steampowered.com/app/4000/', u'745,807']
[u'444090', u'http://store.steampowered.com/app/444090/', u'745,807']
[u'359550', u'http://store.steampowered.com/app/359550/', u'745,807']
[u'588430', u'http://store.steampowered.com/app/588430/', u'745,807']
[u'374320', u'http://store.steampowered.com/app/374320/', u'745,807']
[u'8930', u'http://store.steampowered.com/app/8930/', u'745,807']
[u'107410', u'http://store.steampowered.com/app/107410/', u'745,807']
[u'238960', u'http://store.steampowered.com/app/238960/', u'745,807']
[u'304930', u'http://store.steampowered.com/app/304930/', u'745,807']
[u'10', u'http://store.steampowered.com/app/10/', u'745,807']
[u'72850', u'http://store.steampowered.com/app/72850/', u'745,807']
[u'289070', u'http://store.steampowered.com/app/289070/', u'745,807']
[u'105600', u'http://store.steampowered.com/app/105600/', u'745,807']
[u'377160', u'http://store.steampowered.com/app/377160/', u'745,807']
[u'236390', u'http://store.steampowered.com/app/236390/', u'745,807']
[u'292030', u'http://store.steampowered.com/app/292030/', u'745,807']
[u'227300', u'http://store.steampowered.com/app/227300/', u'745,807']
[u'386360', u'http://store.steampowered.com/app/386360/', u'745,807']
[u'236850', u'http://store.steampowered.com/app/236850/', u'745,807']
[u'364360', u'http://store.steampowered.com/app/364360/', u'745,807']
[u'381210', u'http://store.steampowered.com/app/381210/', u'745,807']
[u'363970', u'http://store.steampowered.com/app/363970/', u'745,807']
[u'453480', u'http://store.steampowered.com/app/453480/', u'745,807'
... ... ...
我想提取周围有红色椭圆的值:
(appid,当前玩家,游戏名称) - 我可以成功获得每个游戏的appid和游戏名称,但不能依次获得当前玩家
答案 0 :(得分:0)
我会尝试抓住每一行,然后像这样抓住.currentServers
的第一个实例。
rows = soup.find_all(class_='player_count_row')
for row in rows:
print row.find(class_='currentServers').text
答案 1 :(得分:0)
你有没有理由使用两个循环?
如果没有,你可以尝试一个循环,当循环链接时,找到前一个tr
,然后找到包含你想要的玩家编号的第一个td
。
示例:
for link in links:
players = currentPlayers[0].text
try:
appid = link.get('onmouseover')
appid = findAppIdFromStats(appid,'"id":' , ',"public":1')
linkg = link.get('href')
except AttributeError:
r.append(["N/A","N/A","N/A"])
r.append([appid, linkg, link.find_previous("tr", class_="player_count_row").find("td").get_text(strip=True)])
答案 2 :(得分:0)
我设法通过修改代码来修复它:
links = soup.findAll("a", { "class" : "gameLink" })
currentPlayers = soup.findAll("span", {"class" : "currentServers"})
players = ""
rows = soup.findAll("tr", { "class" : "player_count_row" })
for row in rows:
players = row.findAll("span", { "class" : "currentServers" })[0].text
for link in links:
try:
appid = link.get('onmouseover')
appid = findAppIdFromStats(appid,'"id":' , ',"public":1')
linkg = link.get('href')
except AttributeError:
r.append(["N/A","N/A","N/A"])
r.append([appid,linkg,players])