我正尝试使用以下代码从R中的html页面提取数据:
wiki_url_html <- read_html("https://en.wikipedia.org/wiki/List_of_Major_League_Baseball_players_suspended_for_performance-enhancing_drugs")
bb_player_PED <- (wiki_url_html %>%
html_nodes(xpath = '//table[3]') %>% html_table())[[1]]
head(bb_player_PED, 10)
给我以下数据:
Player Team Date announced Drug Penalty Position
1 Sánchez, AlexAlex Sánchez Tampa Bay Devil Rays 000000002005-04-03-0000April 3, 2005 010 !10 days OF
2 Montero, AgustínAgustín Montero Texas Rangers 000000002005-04-20-0000April 20, 2005 010 !10 days P
3 Strong, JamalJamal Strong Seattle Mariners 000000002005-04-26-0000April 26, 2005 010 !10 days OF
4 Rincón, JuanJuan Rincón Minnesota Twins 000000002005-05-02-0000May 2, 2005 010 !10 days P
5 Betancourt, RafaelRafael Betancourt Cleveland Indians 000000002005-07-08-0000July 8, 2005 010 !10 days P
6 Palmeiro, RafaelRafael Palmeiro SS GG Baltimore Orioles 000000002005-08-01-0000August 1, 2005 Stanozolol 010 !10 days DH
7 Franklin, RyanRyan Franklin Seattle Mariners 000000002005-08-02-0000August 2, 2005 010 !10 days P
8 Morse, MikeMike Morse Seattle Mariners 000000002005-09-07-0000September 7, 2005 010 !10 days SS
9 Almanzar, CarlosCarlos Almanzar Texas Rangers 000000002005-10-04-0000October 4, 2005 010 !10 days P
10 Heredia, FélixFélix Heredia New York Mets 000000002005-10-18-0000October 18, 2005 010 !10 days P
Response Ref.
1 [a] [5]
2 [b] [7]
3 [c] [9]
4 [d] [9]
5 [e] [12]
6 [f] [14]
7 [g] [16]
8 [h] [18]
9 [i] [20]
10 [j] [22]
我的问题是如何从数据中删除span标记,以便某些数据在Player,Date Announced,Penalty等列中不会出现两次?
我理解这是因为该表有span标签,并且使用上面的代码连接span标签的数据。
我尝试了以下内容:
removeNodes(getNodeSet(xmlTreeParse(wiki_url_html, useInternalNodes = T), "//table/tr/th/i/span"))
像堆栈溢出帖子一样:
Scraping a complex HTML table into a data.frame in R
删除span标记,但它只返回NULL。
任何帮助将不胜感激,谢谢。