I'm pretty new to webscraping, so I wrote a small little script to extract player scores from this site: http://fold.it/portal/players
Here's the code:
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen("http://www.fold.it/portal/players").read()
for row in soup('tr', {'class':'even'}):
rank = row('td')[0].string
td2 = row('td')[1]
for name in td2('a'):
user = name.text
score = row('td')[2].string
print rank, user, score
Now, this works pretty well except the user also has the two other scores in their name as well. Looking at the html, it seems there are two span elements after the a href.
My first thought was to split 'user' on white space, but some names have spaces in them, so that didn't work. I also thought about looking for numeric, but some users have numeric names as well.
I figure eliminating the span is my best option. However, I'm not sure what the best way to parse them out would be. Any help would be appreciated!
1 个答案:
答案 0 :(得分:3)
The scores are in the separate span tags - use it:
for row in soup('tr', {'class': 'even'}):
cells = row('td')
rank = cells[0].string
# finding the first text node - this is our name
name = cells[1].a.find(text=True).strip()
# ranks are in two separate `span` tags
rank1, rank2 = cells[1].find_all("span")
print name, rank1.text, rank2.text
Prints:
Galaxie 1 3
smilingone 2 35
LociOiling 3 9
Desnouck Maarten 4 153
...