Eliminating Span Elements in a nested TD using BeautifulSoup

时间:2015-10-30 23:26:38

标签: html css parsing web-scraping beautifulsoup

I'm pretty new to webscraping, so I wrote a small little script to extract player scores from this site: http://fold.it/portal/players Here's the code: import urllib2 from bs4 import BeautifulSoup soup = BeautifulSoup(urllib2.urlopen("http://www.fold.it/portal/players").read() for row in soup('tr', {'class':'even'}): rank = row('td')[0].string td2 = row('td')[1] for name in td2('a'): user = name.text score = row('td')[2].string print rank, user, score Now, this works pretty well except the user also has the two other scores in their name as well. Looking at the html, it seems there are two span elements after the a href. My first thought was to split 'user' on white space, but some names have spaces in them, so that didn't work. I also thought about looking for numeric, but some users have numeric names as well. I figure eliminating the span is my best option. However, I'm not sure what the best way to parse them out would be. Any help would be appreciated!

1 个答案:

答案 0 :(得分:3)

The scores are in the separate span tags - use it: for row in soup('tr', {'class': 'even'}): cells = row('td') rank = cells[0].string # finding the first text node - this is our name name = cells[1].a.find(text=True).strip() # ranks are in two separate `span` tags rank1, rank2 = cells[1].find_all("span") print name, rank1.text, rank2.text Prints: Galaxie 1 3 smilingone 2 35 LociOiling 3 9 Desnouck Maarten 4 153 ...