无法仅使用BS4从表中提取可见文本

时间:2016-07-28 19:47:47

标签: python html web-scraping beautifulsoup wikipedia

我正在尝试从this网页抓取位置和玩家名称等数据。我的代码如下。

#create url for the wikipedia data we are going to scrape
wikiURL = "https://en.wikipedia.org/wiki/2012_NFL_Draft"

#create array to store player info in
teams_players = []

# request and parse wikiURL
r = requests.get(wikiURL)
soup = BeautifulSoup(r.content, "html.parser")

#find table in wikipedia
playerData = soup.find('table', {"class": "wikitable sortable"})

for row in playerData.find_all('tr')[1:]:
    cols = row.find_all(['td', 'th'])
    if len(cols) < 6:
        continue
    teams_players.append((cols[5].text.strip(), cols[4].text.strip() ))

for team, player in teams_players:
    print('{:35} {}'.format(team, player))

问题是有一个&#34; sortkey&#34; span标记带有文本和名称字段中显示的文本,因此输出最终会加倍并显示符号。

QB                                  Luck, AndrewAndrew Luck †
QB                                  Griffin III, RobertRobert Griffin III †

我尝试过搜索{&#34; class&#34;:&#34; fn&#34;}但这只会返回一个空括号列表。

我怎样才能拉出显示的文字并省略符号?

1 个答案:

答案 0 :(得分:2)

If you just want the name and the position, you can simplify the code to look for each span inside each td of the table with the class fn, get the text from that then look for the next td and extract the text from the td's anchor.

from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/2012_NFL_Draft").content,"lxml")
table = soup.select_one("table.wikitable.sortable")

for name_tag in table.select("tr + tr td span.fn"):
    print(name_tag.text, name_tag.find_next("td").a.text)

If we run the code, you can see we get all the data we want and without any symbols:

In [1]: from bs4 import BeautifulSoup
   ...: import requests
   ...: soup = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/2012_NF
   ...: L_Draft").content,"lxml")
   ...: table = soup.select_one("table.wikitable.sortable")
   ...: for name_tag in table.select("tr + tr td span.fn"):
   ...:     print(name_tag.text, name_tag.find_next("td").a.text)
   ...:     

Andrew Luck QB
Robert Griffin III QB
Trent Richardson RB
Matt Kalil OT
Justin Blackmon WR
Morris Claiborne CB
Mark Barron S
Ryan Tannehill QB
Luke Kuechly LB
Stephon Gilmore CB
Dontari Poe NT
Fletcher Cox DT
Michael Floyd WR
Michael Brockers DT
Bruce Irvin DE
Quinton Coples DE
Dre Kirkpatrick CB
Melvin Ingram LB
Shea McClellin DE
Kendall Wright WR
Chandler Jones DE
Brandon Weeden QB
Riley Reiff OT
David DeCastro G
Dont'a Hightower LB
Whitney Mercilus DE
Kevin Zeitler G
Nick Perry LB
Harrison Smith S
A. J. Jenkins WR
Doug Martin RB
David Wilson RB
Brian Quick WR
Coby Fleener TE
Courtney Upshaw LB
Derek Wolfe DT
Mitchell Schwartz OT
Andre Branch DE
Janoris Jenkins CB
Amini Silatolu G
Cordy Glenn OT
Jonathan Martin OT
Stephen Hill WR
Jeff Allen G
Alshon Jeffery WR
Mychal Kendricks LB
Bobby Wagner LB
Tavon Wilson S
Kendall Reyes DT
Isaiah Pead RB
Jerel Worthy DT
Zach Brown LB
Devon Still DT
Ryan Broyles WR
Peter Konz C
Mike Adams OT
Brock Osweiler QB
Lavonte David LB
Vinny Curry DE
Kelechi Osemele G
LaMichael James RB
Casey Hayward CB
Rueben Randle WR
Dwayne Allen TE
Trumaine Johnson CB
Josh Robinson CB
Ronnie Hillman RB
DeVier Posey WR
T. J. Graham WR
Bryan Anger P
Josh LeRibeus G
Olivier Vernon DE
Brandon Taylor S
Donald Stephenson OT
Russell Wilson QB
Brandon Brooks G
Demario Davis LB
Michael Egnew TE
Brandon Hardin S
Jamell Fleming CB
Tyrone Crawford DE
Mike Martin DT
Mohamed Sanu WR
Bernard Pierce RB
Dwight Bentley CB
Sean Spence LB
John Hughes DT
Nick Foles QB
Akiem Hicks DT
Jake Bequette DE
Lamar Holmes OT
T. Y. Hilton WR
Brandon Thompson DT
Jayron Hosley CB
Tony Bergstrom G
Chris Givens WR
Lamar Miller RB
Gino Gradkowski G
Ben Jones C
Travis Benjamin WR
Omar Bolden CB
Kirk Cousins QB
Frank Alexander DE
Joe Adams WR
Nigel Bradham LB
Robert Turbin RB
Devon Wylie WR
Philip Blake C
Alameda Ta'amu DT
Ladarius Green TE
Evan Rodriguez TE
Bobby Massie OT
Kyle Wilber LB
Jaye Howard DT
Coty Sensabaugh CB
Orson Charles TE
Joe Looney G
Jarius Wright WR
Keenan Robinson LB
James-Michael Johnson LB
Keshawn Martin WR
Nick Toon WR
Brandon Boykin CB
Ron Brooks CB
Ronnell Lewis LB
Jared Crick DE
Adrien Robinson TE
Rhett Ellison FB
Miles Burris LB
Christian Thompson S
Brandon Mosley OT
Mike Daniels DT
Jerron McMillian S
Greg Childs WR
Matt Johnson S
Josh Chapman DT
Malik Jackson DE
Tahir Whitehead LB
Robert Blanton S
Najee Goode LB
Adam Gettis G
Brandon Marshall LB
Josh Norman CB
Zebrie Sanders OT
Taylor Thompson DE
DeQuan Menzie CB
Tank Carder LB
Chris Greenwood CB
Johnnie Troutman G
Rokevious Watkins G
Senio Kelemete G
Danny Coale WR
Dennis Kelly OT
Korey Toomer LB
Josh Kaddu LB
Shaun Prater CB
Bradie Ewing FB
Jack Crawford DE
Chris Rainey RB
Ryan Miller G
Randy Bullock K
Corey White S
Terrell Manning LB
Jonathan Massaquoi DE
Darius Fleming LB
Marvin Jones WR
George Iloka S
Juron Criner WR
Asa Jackson CB
Vick Ballard RB
Greg Zuerlein K
Jeremy Lane CB
Alfred Morris RB
Keith Tandy CB
Blair Walsh K
Mike Harris CB
Justin Bethel S
Mark Asper G
Andrew Tiller G
Trenton Robinson S
Winston Guy S
Cyrus Gray RB
B.J. Cunningham WR
Isaiah Frey CB
Ryan Lindley QB
James Hanna TE
Josh Bush S
Danny Trevathan LB
Christo Bilukidi DT
Markelle Martin S
Dan Herron RB
Charles Mitchell S
Tom Compton OT
Marvin McNutt WR
Nick Mondek OT
Jonte Green CB
Nate Ebner CB
Tommy Streeter WR
Jason Slowey OT
Brandon Washington G
Matt McCants OT
Terrance Ganaway RB
Robert Griffin G
Emmanuel Acho LB
Billy Winn DT
LaVon Brazill WR
Brad Nortman P
Justin Anderson G
Audie Cole LB
Scott Solomon DE
Michael Smith RB
Richard Crawford CB
Kheeston Randall DT
D. J. Campbell S
Jordan Bernstine CB
Jerome Long DT
Trevor Guyton DE
Greg McCoy CB
Nate Potter OT
Caleb McSurdy ILB
Travis Lewis OLB
Alfonzo Dennard CB
J. R. Sweezy G
David Molk C
Rishard Matthews WR
Jeris Pendleton DT
Bryce Brown RB
Nathan Stupar OLB
Toney Clemons WR
Greg Scruggs DE
Drake Dunsmore TE
Marcel Jones OT
Jeremy Ebert WR
DeAngelo Tyson DT
Cam Johnson DE
Junior Hemingway WR
Markus Kuhn DT
David Paulson TE
Andrew Datko OT
Antonio Allen S
B. J. Coleman QB
Jordan White WR
Trevin Wade CB
Terrence Frederick CB
Brad Smelley TE
Kelvin Beachum G
Travian Robertson DT
Edwin Baker RB
John Potter K
Daryl Richardson RB
Chandler Harnish QB